Re: Excerpting algos

Father Chrysostomos Fri, 05 Jun 2009 16:43:03 -0700


On Jun 3, 2009, at 10:25 PM, Marvin Humphrey wrote:

Right now in the KS implementation, sentence boundary information is
calculated on the fly at runtime, via Highlighter_Find_Sentences().However,this seems wasteful, because sentence boundaries can be known atindex-time.Perhaps we ought to be storing sentence boundary information in theindex.

Would you extend the Analysis interface to allow for custom sentencealgorithms? Could the sentences be numbered, so the final fragment hasinformation about *which* sentence it came from? (I could use this forpagination.)

Perhaps if each Span were to include a reference to the originalQuery object
which produced it?  These would be primitives such as TermQuery and
PhraseQuery rather than compound queries like ANDQuery. Would thatreferencebe enough to implement a preference for term diversity in theexcerpting algo?

There is one scenario I can think of where that *might* not work. Ifsomeone searches for a list of keywords that includes the same keywordtwice (e.g., I sometimes copy and paste a sentence to find documentswith similar content), then there will be two TermQueries that areidentical but considered different. Maybe this won’t matter becausethe duplicate term should have extra weight. I haven’t thought thisthrough.

And might that information come in handy for other excerpting algos?

As long as the supplied Term/PhraseQuery is the original object, andnot a clone, I think it would.

Re: Excerpting algos

Reply via email to