On 19/01/2012 03:28, Marvin Humphrey wrote:
Phase 3 can be implemented several different ways.  It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.

You're right. I was only talking about Phase 3.

Such an approach wouldn't depend on the analyzer at all and it wouldn't
introduce additional coupling of Lucy's components.

Not sure what I'm missing, but I don't understand the "coupling" concern.  It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.

The analyzers are designed so split a whole string into tokens. In the highlighter we only need to find a single boundary near a certain position in a string. So the analyzer interface isn't an ideal fit for the highlighter. The performance hit of running a tokenizer over the whole substring shouldn't be a problem but I'd still like to consider alternatives.

I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result.  Separation of concerns FTW!

We could also move the boundary detection to a string utility class.

Of course, it would mean to implement a separate Unicode-capable word
breaking algorithm for the highlighter. But this shouldn't be very hard as
we could reuse parts of the StandardTokenizer.

IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.

I would keep the sentence boundary detection, of course. I'm only talking about the word breaking part.

Nick

Reply via email to