On Jan 15, 2010, at 9:48 AM, Drew Farris wrote:

> On Fri, Jan 15, 2010 at 9:30 AM, Grant Ingersoll <[email protected]> wrote:
>> 
>> Yeah, I've even found using Java's BreakIterator (there's one for Sentences 
>> and it is supposedly Locale aware) plus some simple edge case modifications 
>> does quite well.  I've got an implementation/demo in Taming Text, I think, 
>> but may also have one laying around somewhere else.
>> 
>> Only trick thing is you have to buffer the tokens in Lucene, which is 
>> slightly annoying with the new incrementToken API, but not horrible.  Then, 
>> once you find the break, just output a > > special token.  Maybe also 
>> consider increasing the position increment.
> 
> In this case it sounds like it might be useful to do sentence chinking
> prior to even getting the Analyzer involved. The BreakIterator returns
> offsets which can be used in a substring call to create a StringReader
> which then gets passed to the Analyzer. substring operates on the
> char[] of the original string, so the only overhead would be the
> allocation of the StringReaders.
> 
> E.g, something like
> 
> while (breakIterator.next()) {
>   StringReader r = new
> StringReader(input.substring(breakIterator.start(),
> breakIterator.end());
>   a.tokenStream(null, r);
>   [..generate and collect ngrams here..]
>   r.close()
> }
> 
> On second thought however it would probably be more convenient if I
> packaged the sentence boundary detection into the analyzer itself so
> that the behavior can be easily changed by the end user. This would
> include the way in which I use the ShingleFilter to generate n-grams
> which is currently external to the Analyzer that gets plugged in.

Yeah, I think it makes sense to have a SentenceTokenFilter (as well as 
ParagraphTokenFilter).  In fact, this would be a welcome contribution to Lucene 
as a new package under the Analyzers in a "o.a.l.analysis.boundary" package (to 
include other boundary detection techniques, such as paragraph, etc.)  Define a 
common set of constants that indicate the boundary and then we can have 
different implementations.  If you really wanted to go nuts, you could create a 
SpanBoundaryQuery classes that took in other clauses along w/ the boundary 
token and did a SpanNearQuery within boundaries.  Of course, I don't want to 
distract you from contributing to Mahout, so... 

> 
> Any idea what sort of edge cases I need to look for when using BreakIterator?

Buy the book  :-)...  Just kidding, it doesn't handle abbreviations very well, 
is the first thing that jumps to mind.  I seem to recall needing less than 10 
or so rules to do a pretty decent job.   Never did formal testing on it, though.


> 
> At this point, I'm thinking it is probably worth trying to get
> something self-contained implemented for this relatively
> straightforward need as opposed to pulling in something like OpenNLP
> or Gate.

Right, although just slightly ironic that we are using a rule-based system for 
a machine learning project.

-Grant

Reply via email to