On Thu, Jun 4, 2009 at 1:25 AM, Marvin Humphrey <[email protected]> wrote:
> The tricky part of highlighting is selecting one or more excerpts. Unless we
> end up with a very strange excerpting algorithm that loses positional
> information, applying the actual highlighting tags at the end will be a
> straightforward, mechanical process.
>
> To select the best fragments, we will create objects representing candidates
> and assemble them in a PriorityQueue. The queue's Less_Than() method will
> favor "better" fragments. Whether Less_Than() performs real analysis on the
> fly, or whether we score the fragments in advance and have Less_Than simply
> compare fixed scores has yet to be determined, but the important point is that
> we must design our OO hierarchy and fragment ranking infrastructure to support
> as many excerpting algorithms as possible.
This sounds nice and pluggable. I get to provide my ranking
(implement Less_Than()) and the framework does the rest...
can/would I separately score a Fragment from an Extract?
I assume extracting vs highlighting will be well decoupled (yet, share
positional match details)? So that eg I could NOT do extracting, but
do highlighting? (Eg, pull a fixed abstract for the hit, previously
stored in the index in entirety, and run highlighting on it; also run
highlighting (not extracting) on the presented title of each doc).
> There will be two main sources of input to the fragment ranker.
>
> * Information from the Query about where it matched.
> * Sentence boundary information.
>
> Right now in the KS implementation, sentence boundary information is
> calculated on the fly at runtime, via Highlighter_Find_Sentences(). However,
> this seems wasteful, because sentence boundaries can be known at index-time.
> Perhaps we ought to be storing sentence boundary information in the index.
I agree: this analysis really ought to be done at indexing time &
stored away.
Like Father, I also think this should be generic (not just
"sentences"): maybe I want extracts from the abstract only, or [say]
collated by page or chapter, section, or to favor sentences early in
the paragraph or page, a fragment should not cross a chapter/section
boundary, etc
We need some generic way to record document "spans" in the index such
that the extract scorer can consult that info.
Such metadata in the index can also help us solve the "final inch"
problem (how to take the user to the exact spot(s) in a large doc that
"matched "the query).
EG, say I have a collections of PDF reference documention or
something. I can treat each page of each doc as a "sub-result", while
still indexing the entire PDF as one document. So you see one search
result ("group") for this large PDF, but perhaps up to 3 different
clickable pages under there. Each hit inside the extract could be
clickable, and would take you straight to the spot in the page where
the hit came from.
> Information from the query about where it matched can be gleaned from the
> weighted-query class. In Lucene, this class goes by the truly dreadful name
> of "Weight"; in KinoSearch, it goes by the still-dreadful-but-different name
> "Compiler". We need to use Weight/Compiler rather than Query because some
> algorithms (e.g. current KS) depend on IDF, which is only known after
> weighting the Query against a given collection of documents to produce a
> Weight/Compiler.
>
> The current KS implementation, Compiler's Highlight_Spans() method, returns a
> VArray of Span objects, each of which has an offset (measured in Unicode code
> points from the top of the field), a length (also a count of Unicode code
> points), and a floating point "weight". Compound queries such as ANDQuery and
> ORQuery simply produce a VArray which unions the Spans produced by their
> child nodes.
So with AND/OR query we lose information about how the sub-spans are
supposed to match up.
But I agree, that limitation could be academic, and may simply not
matter in practice, if eg we strongly favor term diverisity.
Lucene's Query/Weight/Scorer, unfortunately, cannot produce the spans
(unless it's a SpanQuery). I think Lucene ought to simply combine
normal & span queries.
> I have an intuitive feeling that an array of weighted score spans will be
> useful in other contexts besides highlighting, and the method is pretty easy
> to grok. However, this simple algo has a flaw: it's not clear what part of
> the query produced each score span. Some algorithms will want to influence
> selection of fragments based on term diversity, so that for example, multiple
> fragments would be preferred over a single fragment if they represented
> different parts of the query.
>
> Perhaps if each Span were to include a reference to the original Query object
> which produced it? These would be primitives such as TermQuery and
> PhraseQuery rather than compound queries like ANDQuery. Would that reference
> be enough to implement a preference for term diversity in the excerpting algo?
> And might that information come in handy for other excerpting algos?
Referring the original atomic query on each hit makes perfect sense.
I assume the output of highlighting will be fully structured objects?
Eg an Extract is one or more Fragments, each Fragment is a series of
term chunks (concatenated). Each term chunk has information like it
was or was not a hit; analysis metadata (sentence, abstract, section,
page, whatever); if it was a hit, which original query(ies) hit, etc.
(So that one can cast this output to HTML, XML, some special UI, etc.)
Mike