Highlighting - catering for all query types

mark harwood Mon, 19 Oct 2009 03:30:21 -0700

I've been putting together some code to support highlighting of opaque query 
clauses (cached filters, trie range, spatial etc etc) which shows some promise.


This is not intended as a replacement for the existing highlighter(s) which 
deal with free-text but is instead concentrating on the hard-to-highlight 
clauses and has the benefit of working in-line with the query process.
Summarisation is not a requirement here - I simply need to know if a given 
query clause matched on a result.

The approach I have come up with is to wrap query clauses with lightweight 
(processing and RAM-wise) instrumenting objects in order to record which 
clauses matched.
The recorded matches are encoded as a byte in the document score which 
unfortunately requires some loss of precision in the scores - more on this 
later.

The general approach for use looks like this:

        //Wrap *any* type of query object for highlight flagging and allocate a 
flag number between 1 and 8 for the clauses of interest....
        FlagRecordingQuery frqA=new FlagRecordingQuery(new TermQuery(new 
Term("statusField","published")),1);
        FlagRecordingQuery frqB=new FlagRecordingQuery(new 
XyzLtd3rdPartyQuery("imageDataField", "unknown magic to find 'sunset'")),2);

        BooleanQuery bq=new BooleanQuery();
        bq.add(new BooleanClause(frqA,Occur.SHOULD));
        bq.add(new BooleanClause(frqB,Occur.SHOULD));

        //Parent query must be a FlagCombiningQuery to encode child match info 
in the doc scores
        FlagCombiningQuery fcq=new FlagCombiningQuery(bq);

        //Run search
        TopDocs td = s.search(fcq,10);
        ScoreDoc[] sd = td.scoreDocs;
        for (ScoreDoc scoreDoc : sd)
        {
            float score=scoreDoc.score;

            //Check to see which flags are encoded in the score.
            if(FlagCombiningQuery.hasFlag(1, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 1 ");
            }
            if(FlagCombiningQuery.hasFlag(2, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 2 ");
            }
        }


The FlagRecordingQuery child clauses introduce themselves to the 
FlagCombiningQuery through a thread local at "rewrite" time.
The FlagCombiningQuery at the root adjusts the scores as follows:

        static final float DEFAULT_MULTIPLIER=1000f;
        float multiplier=DEFAULT_MULTIPLIER;
    ....
        public float score() throws IOException
        {
            float score = delegateScorer.score();
            byte flags=0;
            int d=doc();
            //encode all matched child clauses into a "flags" byte.
            for (FlagRecordingQuery frq : thisThreadsFlags)
            {
                if(frq.matched(d))
                {
                    byte mask=flagMasks[frq.flag-1];
                    flags=setFlag(flags, mask);
                }
            }

            //Multiply score to turn float into int with sufficient fractions 
in score.
            int shiftedI=(int) (score*multiplier);
            //Shift int to make space for byte holding flags
            int iPlusSpaceForByte=shiftedI<<8;
            //Add match flags
            int iCombinedScoreAndFlags=iPlusSpaceForByte|flags;
            System.out.println("combined score="+iCombinedScoreAndFlags+" for 
doc#"+doc());
            return iCombinedScoreAndFlags;
        }

The mechanism works but relies on original score values that :
a) Are not too big - i.e. do not lose significant digits when multiplied by 
"multiplier" and then shifted left 8 bits.
b) Are not too similar - i.e. only differ in very small fractions e.g. all 
scores occur in the range 0.1234 to 0.1235

To give an indication of restrictions this imposes here are the usable score 
ranges for various settings of "multiplier":

multiplier       max score   fraction precision
======   ========   =============
10           838860         0.x
100         83886              0.xx
1000       8388             0.xxx
10000     838               0.xxxx

I would imagine the majority of Lucene query results would still rank sensibly 
with a 1,000 or 10,000 multiplier.

However, all this potentially dangerous bit twiddling could of course be 
avoided if the Lucene search API was expanded to include docid, score AND a 
completely seperate field for recording match flags. 


Thoughts?




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Highlighting - catering for all query types

Reply via email to