On Fri, Jan 15, 2010 at 10:00 AM, Grant Ingersoll <[email protected]> wrote: > > Yeah, I think it makes sense to have a SentenceTokenFilter (as well as > ParagraphTokenFilter). In fact, this would be a welcome contribution to > Lucene as a new package under the Analyzers in a "o.a.l.analysis.boundary" > package (to include other boundary detection techniques, such as paragraph, > etc.) Define a common set of constants that indicate the boundary and then > we can have different implementations. If you really wanted to go nuts, you > could create a SpanBoundaryQuery classes that took in other clauses along w/ > the boundary token and did a SpanNearQuery within boundaries. Of course, I > don't want to distract you from contributing to Mahout, so...
Ok, thanks for the pointer and roughing out an approach. I'll look into a SentenceTokenFilter and see where that takes me. >> >> Any idea what sort of edge cases I need to look for when using BreakIterator? > > Buy the book :-)... Just kidding, it doesn't handle abbreviations very > well, is the first thing that jumps to mind. I seem to recall needing less > than 10 or so rules to do a pretty decent job. Never did formal testing on > it, though. Ok, OK :-) I've found abbrevs, various identifiers etc are sort of a typical case where these things fall flat. I'll see how it performs viz writing something from scratch and see what I can come up with. > Right, although just slightly ironic that we are using a rule-based system > for a machine learning project. Heh, indeed, but it seems entirely appropriate in this case. Of course, now I need to go read about statistical approaches to sentence boundary detection. Drew
