On Fri, Jan 15, 2010 at 10:00 AM, Grant Ingersoll <[email protected]> wrote:
>
> Yeah, I think it makes sense to have a SentenceTokenFilter (as well as 
> ParagraphTokenFilter).  In fact, this would be a welcome contribution to 
> Lucene as a new package under the Analyzers in a "o.a.l.analysis.boundary" 
> package (to include other boundary detection techniques, such as paragraph, 
> etc.)  Define a common set of constants that indicate the boundary and then 
> we can have different implementations.  If you really wanted to go nuts, you 
> could create a SpanBoundaryQuery classes that took in other clauses along w/ 
> the boundary token and did a SpanNearQuery within boundaries.  Of course, I 
> don't want to distract you from contributing to Mahout, so...

Ok, thanks for the pointer and roughing out an approach. I'll look
into a SentenceTokenFilter and see where that takes me.

>>
>> Any idea what sort of edge cases I need to look for when using BreakIterator?
>
> Buy the book  :-)...  Just kidding, it doesn't handle abbreviations very 
> well, is the first thing that jumps to mind.  I seem to recall needing less 
> than 10 or so rules to do a pretty decent job.   Never did formal testing on 
> it, though.

Ok, OK :-)

I've found abbrevs, various identifiers etc are sort of a typical case
where these things fall flat. I'll see how it performs viz writing
something from scratch and see what I can come up with.

> Right, although just slightly ironic that we are using a rule-based system 
> for a machine learning project.

Heh, indeed, but it seems entirely appropriate in this case. Of
course, now I need to go read about statistical approaches to sentence
boundary detection.

Drew

Reply via email to