On Thu, Jan 19, 2012 at 11:43:59AM +0100, Nick Wellnhofer wrote: >> Not sure what I'm missing, but I don't understand the "coupling" concern. It >> seems to me as though it would be desirable code re-use to wrap our sentence >> boundary detection mechanism within a battle-tested design like Analyzer, >> rather than do something ad-hoc. > > The analyzers are designed so split a whole string into tokens. In the > highlighter we only need to find a single boundary near a certain > position in a string. So the analyzer interface isn't an ideal fit for > the highlighter. The performance hit of running a tokenizer over the > whole substring shouldn't be a problem but I'd still like to consider > alternatives.
It's rare that we need to optimize for performance. Most of the time we should be optimizing for maintainability. I'm advocating using Analyzer because we have several of them, and because the parallelism between StandardTokenizer and a StandardSentenceTokenizer based on UAX #29 would lower the cost of maintaining them. However, that's only one way to optimize for maintainability, and it's not necessarily the best available stratagem. It may be that low level code leveraging an Analyzer is verbose... or not... we'd just have to try. >> I'm actually very excited about getting all that sentence boundary detection >> stuff out of Highlighter.c, which will become much easier to grok and >> maintain >> as a result. Separation of concerns FTW! > > We could also move the boundary detection to a string utility class. I suspect that at some point we will want to expose sentence boundary detection via a public API, because people who subclass Highlighter may want to use it. Father Chrysostomos did when he wrote KSx::Highlight::Summarizer. (The old KinoSearch Highlighter exposed a find_sentences() method at one point. It was a victim of the C rewrite; Highlighter was one of the harder modules to port.) It seems to me that publishing UAX #29 sentence boundary detection via an Analyzer is a conservative API extension, since it's so closely related to the UAX #29 word boundary detection we expose via StandardTokenizer. So that explains what I was thinking. But of course refactoring sentence boundary detection into a string utility function also achieves the end of cleaning up Highlighter.c just as effectively, and might be more elegant -- who knows? Until we actually expose this capability via a public API, either approach should work fine. >>> Of course, it would mean to implement a separate Unicode-capable word >>> breaking algorithm for the highlighter. But this shouldn't be very hard as >>> we could reuse parts of the StandardTokenizer. >> >> IMO, a word-breaking algo doesn't suffice for choosing excerpt boundaries. >> It looks much better if you trim excerpts at sentence boundaries, and >> word-break algos don't get you those. > > I would keep the sentence boundary detection, of course. I'm only > talking about the word breaking part. Groovy, sounds like we're on the same page about that then. :) Marvin Humphrey
