On Jan 15, 2010, at 9:48 AM, Drew Farris wrote: > On Fri, Jan 15, 2010 at 9:30 AM, Grant Ingersoll <[email protected]> wrote: >> >> Yeah, I've even found using Java's BreakIterator (there's one for Sentences >> and it is supposedly Locale aware) plus some simple edge case modifications >> does quite well. I've got an implementation/demo in Taming Text, I think, >> but may also have one laying around somewhere else. >> >> Only trick thing is you have to buffer the tokens in Lucene, which is >> slightly annoying with the new incrementToken API, but not horrible. Then, >> once you find the break, just output a > > special token. Maybe also >> consider increasing the position increment. > > In this case it sounds like it might be useful to do sentence chinking > prior to even getting the Analyzer involved. The BreakIterator returns > offsets which can be used in a substring call to create a StringReader > which then gets passed to the Analyzer. substring operates on the > char[] of the original string, so the only overhead would be the > allocation of the StringReaders. > > E.g, something like > > while (breakIterator.next()) { > StringReader r = new > StringReader(input.substring(breakIterator.start(), > breakIterator.end()); > a.tokenStream(null, r); > [..generate and collect ngrams here..] > r.close() > } > > On second thought however it would probably be more convenient if I > packaged the sentence boundary detection into the analyzer itself so > that the behavior can be easily changed by the end user. This would > include the way in which I use the ShingleFilter to generate n-grams > which is currently external to the Analyzer that gets plugged in.
Yeah, I think it makes sense to have a SentenceTokenFilter (as well as ParagraphTokenFilter). In fact, this would be a welcome contribution to Lucene as a new package under the Analyzers in a "o.a.l.analysis.boundary" package (to include other boundary detection techniques, such as paragraph, etc.) Define a common set of constants that indicate the boundary and then we can have different implementations. If you really wanted to go nuts, you could create a SpanBoundaryQuery classes that took in other clauses along w/ the boundary token and did a SpanNearQuery within boundaries. Of course, I don't want to distract you from contributing to Mahout, so... > > Any idea what sort of edge cases I need to look for when using BreakIterator? Buy the book :-)... Just kidding, it doesn't handle abbreviations very well, is the first thing that jumps to mind. I seem to recall needing less than 10 or so rules to do a pretty decent job. Never did formal testing on it, though. > > At this point, I'm thinking it is probably worth trying to get > something self-contained implemented for this relatively > straightforward need as opposed to pulling in something like OpenNLP > or Gate. Right, although just slightly ironic that we are using a rule-based system for a machine learning project. -Grant
