On Jun 3, 2009, at 10:25 PM, Marvin Humphrey wrote:

Right now in the KS implementation, sentence boundary information is
calculated on the fly at runtime, via Highlighter_Find_Sentences(). However, this seems wasteful, because sentence boundaries can be known at index-time. Perhaps we ought to be storing sentence boundary information in the index.

Would you extend the Analysis interface to allow for custom sentence algorithms? Could the sentences be numbered, so the final fragment has information about *which* sentence it came from? (I could use this for pagination.)

Perhaps if each Span were to include a reference to the original Query object
which produced it?  These would be primitives such as TermQuery and
PhraseQuery rather than compound queries like ANDQuery. Would that reference be enough to implement a preference for term diversity in the excerpting algo?

There is one scenario I can think of where that *might* not work. If someone searches for a list of keywords that includes the same keyword twice (e.g., I sometimes copy and paste a sentence to find documents with similar content), then there will be two TermQueries that are identical but considered different. Maybe this won’t matter because the duplicate term should have extra weight. I haven’t thought this through.

And might that information come in handy for other excerpting algos?

As long as the supplied Term/PhraseQuery is the original object, and not a clone, I think it would.

Reply via email to