On Jan 31, 2011, at 4:35 PM, Jörn Kottmann wrote:

> On 1/31/11 10:25 PM, Grant Ingersoll wrote:
>> Yep.  Tom, Drew and I have a lot of it working already (sentence detection, 
>> NER, others).  I think POS tagging will be useful too.
>> 
> Sounds good. I can help out if you need to port things to 1.5 APIs.
> 
>> On a related note, one of the things we talked about is if there is any 
>> interest in patches that can make it easier to use Lucene's token stream 
>> (especially the new AttributeSource stuff) instead of re-inventing the wheel 
>> here.
> 
> Not sure what this is about, can you point us to a link or tell more ? Or 
> should we wait until the first code is released, so we can look at it ?

From what I can see of the OpenNLP tokenization process, it is very basic and 
not all that flexible and deals with String arrays instead of a stream, which 
presumably means OpenNLP has a hard time dealing with really large documents.  
(I realize, of course, we likely would have to buffer tokens as we can't always 
deal w/ a single token at a time, but that wouldn't be that hard to implement)

Lucene's analysis chain, on the other hand is quite flexible and pluggable and 
has support for a wide range of tokenizers and token filters that allow one to:
a) not reinvent the wheel when it comes to hooking in stemmers, etc.
b) mix and match different ways of doing things.

I'd add that Lucene also has a pretty rich attribution layer for tokens, which 
may also come in handy in the future for feature creation.  For instance, I 
wonder if the Parse object could just natively work on Lucene's Tokens (really 
called AttributeSources)

> 
>>  Plus, especially with Lucene trunk, things all work off of bytes instead of 
>> chars, so they are a lot faster.
> This one I hear a lot, but truthfully optimizing the APIs to get content in
> does not speed up our implementation, especially the feature generation
> is slow currently.

Have you profiled it?  I'd be curious to know, especially on large docs.  In 
similar applications, I've seen the poor use of String and String concat be a 
huge bottleneck.  I'll try to take a deeper look next week.

Short term, it might be interesting to implement the Lucene analysis stuff as 
an OpenNLP Tokenizer.


> 
> I also have practical questions.
> How do you deal with different encodings then ? Does users have to pass
> in everything as UTF-8 or convert everything to one encoding ?
> How do you implement features like string to lowercase ?

Have a look at the analysis stuff more.  It's all taken care of.

-Grant

Reply via email to