Re: Excerpting algos

webmasters Sat, 06 Jun 2009 18:06:32 -0700


On Jun 6, 2009, at 9:22 AM, Marvin Humphrey wrote:

On Fri, Jun 05, 2009 at 02:42:46PM -0700, Father Chrysostomos wrote:
First, a bit of good news: I've managed to fix the current KSHighlightersentence-boundary trimming implementation without needing to startover fromscratch, and without causing any problems for theKSx::Highlight::Summarizertest suite. That means we don't have to conclude this discussionand finishthe implementation to unblock a KS dev release. (: For better orworse. :)

I don’t know whether you are aware: I cheated and copied & pasted thefind_sentence_boundaries from KS r3122 to KSx:H:S, since I was in ahurry.

Would you extend the Analysis interface to allow for custom sentence
algorithms?
Since this is a tokenization task, Analyzer would be a logical placeto turn.I think we'll need to make two passes over the text, one for searchtokens and
one for sentences.
Dow we actually need to extend Analyzer, though? I think we oughtto avoidgiving Analyzer a Find_Sentences() method. Instead, we can justcreate anAnalyzer instance which tokenizes at sentence boundaries. Probablywe'll wantto create a dedicated SentenceTokenizer subclass, which would not bepublicly
exposed.

I’ve just had an idea: Since we have 1) words, 2) sentences and 3)pages, why not multiple levels of vector information? Or multiple‘sets’ (which could be orthogonal/overlapping)? Someone may want toinclude paragraphs or chapters, for instance. Just a thought....

Instead, we can turn TermVectorsWriter into a public HighlightWriterclassand give it a Set_Sentence_Tokenizer() method. Extensibility wouldhappen via
Architecture:

 package MyArchitecture;
 use base qw( KinoSearch::Architecture );

 sub register_highlight_writer {
   my ( $self, $seg_writer ) = @_;
   $self->SUPER::register_highlight_writer($seg_writer);
my $hl_writer = $seg_writer->obtain("KinoSearch::Index::HighlightWriter");
   $hl_writer->set_sentence_tokenizer( MySentenceTokenizer->new );
 }


Or maybe $hl_writer->add_tokenizer( MySentenceTokenizer->new );

We may need to distinguish between ‘offset tokenisers’ and ‘termtokenisers’.

I think this approach will work provided that it's possible to usethe samesentence boundary detection algo across most or all of the languagessupportedby Snowball. (Does the basic algo of splitting on /\.\s+/ work forGreek?)

Yes, except for the same problem that it causes in English: ‘M.Humphrey‘ becomes two sentences. (As an aside, your default tokeniserdoesn’t work with Greek, which can have mid-commas, but the only twowords with mid commas [ὅ,τι and ὅ,τιδηποτε] are stop-words, so I don’t worry about it.)

CJK users and others for whom our algo would fail would need to speca customArchitecture -- though only if they want highlighting, since it'soff bydefault. It's a bit more work for that class of user, but itprevents us fromhaving to add clutter to the crucial core classes of Analyzer andSchema.
It will be somewhat wasteful if we use this SentenceTokenizer classto createfull fledged tokens when all we need is offsets, but I think wewould handlefurther optimizations via natural extensions to either Analyzer orInversion.I say "natural", because we would be merely repurposing the sameoffsetinformation that Tokenizer normally feeds to Token's constructor, asopposedto glomming on a Find_Sentences() method which would apply acompletely
different tokenizing algorithm.


Sounds good.

Could the sentences be numbered, so the final fragment hasinformationabout *which* sentence it came from? (I could use this forpagination.)
I think that would work. The current "DocVector" class needs tomutate into
"InvertedDoc" or something like that, and InvertedDoc needs to provide
sentence boundary information somehow.
We often need to use iterators for scaling purposes in KS/Lucy, buthuge docsare problematic for highlighting anyway, so I think we can just gowith twoi32_t arrays: one each for sentence_offsets and sentence_lengths.In theindex, we'd probably store this information as a string of delta-encoded


‘Delta-encoded’?

C32s
representing offset from the top of the field measured in Unicode code
points.

Source:

 "Best. Joke. Ever."

Search-time:

 $inverted_doc->get_sentence_offsets; # [ 0, 6, 12 ]
 $inverted_doc->get_sentence_lengths; # [ 5, 5, 5 ]

In the index:

 0, 5, 1, 5, 1, 5
That preserves your requested sentence numbering information throughreadtime, accessible as array tick in the sentence_offset andsentence_lengths
arrays.
Perhaps if each Span were to include a reference to the originalQueryobject which produced it? These would be primitives such asTermQuery and
PhraseQuery rather than compound queries like ANDQuery.  Would that
reference be enough to implement a preference for term diversityin the
excerpting algo?
There is one scenario I can think of where that *might* not work. If
someone searches for a list of keywords that includes the samekeyword
twice (e.g., I sometimes copy and paste a sentence to find documents
with similar content), then there will be two TermQueries that are
identical but considered different.
All Query classes should be implementing the Equals() method so thatlogically
equivalent objects can be identified.  Does that address your concern?


Yes.

We'll probably want to reference the Compiler/Weight rather than theoriginalQuery; right now in KS I don't think I have Equals() implemented forany
Compiler classes, but that shouldn't be hard to finish.  [1]
Maybe this won’t matter because the duplicate term should haveextra
weight. I haven’t thought this through.
I think the only way we'll nail the extensibility aspect of thisdesign is if
we build working implementations for multiple highlighting algorithms.

Probably your Summarizer and a class which implements the
term-diversity-preferring algo described by Michael Busch and MikeMcCandless
from LUCENE-1522 would be enough.

I would like to make Summarizer value term diversity, so we’ll beleft with one. I could make it an option instead.

And might that information come in handy for other excerpting algos?


As long as the supplied Term/PhraseQuery is the original object, and
not a clone, I think it would.


I think you say that because of the equivalence question, right?


Yes.

The KS Highlighter creates its own internal Compiler object usingthe supplied"searchable" and "query" constructor args. The DocVector/InvertedDoc has tobe able to go over the network, but the score spans won't -- so eachscorespan would always be pointing to some sub-component of that localCompiler
object.
I'm not entirely satisfied with this approach. The Span class hasbeen simpleup till now -- it *could* have been sent over the network with noproblem.Bloating it up with a reference to the Query/Compiler makes it bothless
general and less transportable.

How about $compiler->give_me_the_query_for($span)? (with a bettermethod name, of course.) Or would that make Compiler too complex,since it would have to store a hash (or equivalent) in addition to itsarray of spans?


But I thought queries could be sent over the network.


Father Chrysostomos

Re: Excerpting algos

Reply via email to