On Jun 6, 2009, at 9:22 AM, Marvin Humphrey wrote:
On Fri, Jun 05, 2009 at 02:42:46PM -0700, Father Chrysostomos wrote:
First, a bit of good news: I've managed to fix the current KS
Highlighter
sentence-boundary trimming implementation without needing to start
over from
scratch, and without causing any problems for the
KSx::Highlight::Summarizer
test suite. That means we don't have to conclude this discussion
and finish
the implementation to unblock a KS dev release. (: For better or
worse. :)
I don’t know whether you are aware: I cheated and copied & pasted the
find_sentence_boundaries from KS r3122 to KSx:H:S, since I was in a
hurry.
Would you extend the Analysis interface to allow for custom sentence
algorithms?
Since this is a tokenization task, Analyzer would be a logical place
to turn.
I think we'll need to make two passes over the text, one for search
tokens and
one for sentences.
Dow we actually need to extend Analyzer, though? I think we ought
to avoid
giving Analyzer a Find_Sentences() method. Instead, we can just
create an
Analyzer instance which tokenizes at sentence boundaries. Probably
we'll want
to create a dedicated SentenceTokenizer subclass, which would not be
publicly
exposed.
I’ve just had an idea: Since we have 1) words, 2) sentences and 3)
pages, why not multiple levels of vector information? Or multiple
‘sets’ (which could be orthogonal/overlapping)? Someone may want to
include paragraphs or chapters, for instance. Just a thought....
Instead, we can turn TermVectorsWriter into a public HighlightWriter
class
and give it a Set_Sentence_Tokenizer() method. Extensibility would
happen via
Architecture:
package MyArchitecture;
use base qw( KinoSearch::Architecture );
sub register_highlight_writer {
my ( $self, $seg_writer ) = @_;
$self->SUPER::register_highlight_writer($seg_writer);
my $hl_writer = $seg_writer-
>obtain("KinoSearch::Index::HighlightWriter");
$hl_writer->set_sentence_tokenizer( MySentenceTokenizer->new );
}
Or maybe $hl_writer->add_tokenizer( MySentenceTokenizer->new );
We may need to distinguish between ‘offset tokenisers’ and ‘term
tokenisers’.
I think this approach will work provided that it's possible to use
the same
sentence boundary detection algo across most or all of the languages
supported
by Snowball. (Does the basic algo of splitting on /\.\s+/ work for
Greek?)
Yes, except for the same problem that it causes in English: ‘M.
Humphrey‘ becomes two sentences. (As an aside, your default tokeniser
doesn’t work with Greek, which can have mid-commas, but the only two
words with mid commas [ὅ,τι and ὅ,τιδηποτε] are stop-
words, so I don’t worry about it.)
CJK users and others for whom our algo would fail would need to spec
a custom
Architecture -- though only if they want highlighting, since it's
off by
default. It's a bit more work for that class of user, but it
prevents us from
having to add clutter to the crucial core classes of Analyzer and
Schema.
It will be somewhat wasteful if we use this SentenceTokenizer class
to create
full fledged tokens when all we need is offsets, but I think we
would handle
further optimizations via natural extensions to either Analyzer or
Inversion.
I say "natural", because we would be merely repurposing the same
offset
information that Tokenizer normally feeds to Token's constructor, as
opposed
to glomming on a Find_Sentences() method which would apply a
completely
different tokenizing algorithm.
Sounds good.
Could the sentences be numbered, so the final fragment has
information
about *which* sentence it came from? (I could use this for
pagination.)
I think that would work. The current "DocVector" class needs to
mutate into
"InvertedDoc" or something like that, and InvertedDoc needs to provide
sentence boundary information somehow.
We often need to use iterators for scaling purposes in KS/Lucy, but
huge docs
are problematic for highlighting anyway, so I think we can just go
with two
i32_t arrays: one each for sentence_offsets and sentence_lengths.
In the
index, we'd probably store this information as a string of delta-
encoded
‘Delta-encoded’?
C32s
representing offset from the top of the field measured in Unicode code
points.
Source:
"Best. Joke. Ever."
Search-time:
$inverted_doc->get_sentence_offsets; # [ 0, 6, 12 ]
$inverted_doc->get_sentence_lengths; # [ 5, 5, 5 ]
In the index:
0, 5, 1, 5, 1, 5
That preserves your requested sentence numbering information through
read
time, accessible as array tick in the sentence_offset and
sentence_lengths
arrays.
Perhaps if each Span were to include a reference to the original
Query
object which produced it? These would be primitives such as
TermQuery and
PhraseQuery rather than compound queries like ANDQuery. Would that
reference be enough to implement a preference for term diversity
in the
excerpting algo?
There is one scenario I can think of where that *might* not work. If
someone searches for a list of keywords that includes the same
keyword
twice (e.g., I sometimes copy and paste a sentence to find documents
with similar content), then there will be two TermQueries that are
identical but considered different.
All Query classes should be implementing the Equals() method so that
logically
equivalent objects can be identified. Does that address your concern?
Yes.
We'll probably want to reference the Compiler/Weight rather than the
original
Query; right now in KS I don't think I have Equals() implemented for
any
Compiler classes, but that shouldn't be hard to finish. [1]
Maybe this won’t matter because the duplicate term should have
extra
weight. I haven’t thought this through.
I think the only way we'll nail the extensibility aspect of this
design is if
we build working implementations for multiple highlighting algorithms.
Probably your Summarizer and a class which implements the
term-diversity-preferring algo described by Michael Busch and Mike
McCandless
from LUCENE-1522 would be enough.
I would like to make Summarizer value term diversity, so we’ll be
left with one. I could make it an option instead.
And might that information come in handy for other excerpting algos?
As long as the supplied Term/PhraseQuery is the original object, and
not a clone, I think it would.
I think you say that because of the equivalence question, right?
Yes.
The KS Highlighter creates its own internal Compiler object using
the supplied
"searchable" and "query" constructor args. The DocVector/
InvertedDoc has to
be able to go over the network, but the score spans won't -- so each
score
span would always be pointing to some sub-component of that local
Compiler
object.
I'm not entirely satisfied with this approach. The Span class has
been simple
up till now -- it *could* have been sent over the network with no
problem.
Bloating it up with a reference to the Query/Compiler makes it both
less
general and less transportable.
How about $compiler->give_me_the_query_for($span)? (with a better
method name, of course.) Or would that make Compiler too complex,
since it would have to store a hash (or equivalent) in addition to its
array of spans?
But I thought queries could be sent over the network.
Father Chrysostomos