On 19/01/2012 03:28, Marvin Humphrey wrote:
Phase 3 can be implemented several different ways. It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.
You're right. I was only talking about Phase 3.
Such an approach wouldn't depend on the analyzer at all and it wouldn't
introduce additional coupling of Lucy's components.
Not sure what I'm missing, but I don't understand the "coupling" concern. It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.
The analyzers are designed so split a whole string into tokens. In the
highlighter we only need to find a single boundary near a certain
position in a string. So the analyzer interface isn't an ideal fit for
the highlighter. The performance hit of running a tokenizer over the
whole substring shouldn't be a problem but I'd still like to consider
alternatives.
I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result. Separation of concerns FTW!
We could also move the boundary detection to a string utility class.
Of course, it would mean to implement a separate Unicode-capable word
breaking algorithm for the highlighter. But this shouldn't be very hard as
we could reuse parts of the StandardTokenizer.
IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.
I would keep the sentence boundary detection, of course. I'm only
talking about the word breaking part.
Nick