Re: [lucy-dev] Highlighter excerpt boundaries

Nick Wellnhofer Thu, 19 Jan 2012 02:44:38 -0800

On 19/01/2012 03:28, Marvin Humphrey wrote:

Phase 3 can be implemented several different ways.  It *could* reuse the
original tokenization algo on its own, but that would produce sub-standard
results because Lucy's tokenization algos are generally concerned with words
rather than sentences, and excerpts chosen on word boundaries alone don't look
very good.


You're right. I was only talking about Phase 3.

Such an approach wouldn't depend on the analyzer at all and it wouldn't
introduce additional coupling of Lucy's components.


Not sure what I'm missing, but I don't understand the "coupling" concern.  It
seems to me as though it would be desirable code re-use to wrap our sentence
boundary detection mechanism within a battle-tested design like Analyzer,
rather than do something ad-hoc.

The analyzers are designed so split a whole string into tokens. In thehighlighter we only need to find a single boundary near a certainposition in a string. So the analyzer interface isn't an ideal fit forthe highlighter. The performance hit of running a tokenizer over thewhole substring shouldn't be a problem but I'd still like to consideralternatives.

I'm actually very excited about getting all that sentence boundary detection
stuff out of Highlighter.c, which will become much easier to grok and maintain
as a result.  Separation of concerns FTW!


We could also move the boundary detection to a string utility class.

Of course, it would mean to implement a separate Unicode-capable word
breaking algorithm for the highlighter. But this shouldn't be very hard as
we could reuse parts of the StandardTokenizer.


IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries.
It looks much better if you trim excerpts at sentence boundaries, and
word-break algos don't get you those.

I would keep the sentence boundary detection, of course. I'm onlytalking about the word breaking part.


Nick

Re: [lucy-dev] Highlighter excerpt boundaries

Reply via email to