On Jun 3, 2009, at 10:25 PM, Marvin Humphrey wrote:
Right now in the KS implementation, sentence boundary information is
calculated on the fly at runtime, via Highlighter_Find_Sentences().
However,
this seems wasteful, because sentence boundaries can be known at
index-time.
Perhaps we ought to be storing sentence boundary information in the
index.
Would you extend the Analysis interface to allow for custom sentence
algorithms? Could the sentences be numbered, so the final fragment has
information about *which* sentence it came from? (I could use this for
pagination.)
Perhaps if each Span were to include a reference to the original
Query object
which produced it? These would be primitives such as TermQuery and
PhraseQuery rather than compound queries like ANDQuery. Would that
reference
be enough to implement a preference for term diversity in the
excerpting algo?
There is one scenario I can think of where that *might* not work. If
someone searches for a list of keywords that includes the same keyword
twice (e.g., I sometimes copy and paste a sentence to find documents
with similar content), then there will be two TermQueries that are
identical but considered different. Maybe this won’t matter because
the duplicate term should have extra weight. I haven’t thought this
through.
And might that information come in handy for other excerpting algos?
As long as the supplied Term/PhraseQuery is the original object, and
not a clone, I think it would.