On Jan 31, 2011, at 4:35 PM, Jörn Kottmann wrote: > On 1/31/11 10:25 PM, Grant Ingersoll wrote: >> Yep. Tom, Drew and I have a lot of it working already (sentence detection, >> NER, others). I think POS tagging will be useful too. >> > Sounds good. I can help out if you need to port things to 1.5 APIs. > >> On a related note, one of the things we talked about is if there is any >> interest in patches that can make it easier to use Lucene's token stream >> (especially the new AttributeSource stuff) instead of re-inventing the wheel >> here. > > Not sure what this is about, can you point us to a link or tell more ? Or > should we wait until the first code is released, so we can look at it ?
From what I can see of the OpenNLP tokenization process, it is very basic and not all that flexible and deals with String arrays instead of a stream, which presumably means OpenNLP has a hard time dealing with really large documents. (I realize, of course, we likely would have to buffer tokens as we can't always deal w/ a single token at a time, but that wouldn't be that hard to implement) Lucene's analysis chain, on the other hand is quite flexible and pluggable and has support for a wide range of tokenizers and token filters that allow one to: a) not reinvent the wheel when it comes to hooking in stemmers, etc. b) mix and match different ways of doing things. I'd add that Lucene also has a pretty rich attribution layer for tokens, which may also come in handy in the future for feature creation. For instance, I wonder if the Parse object could just natively work on Lucene's Tokens (really called AttributeSources) > >> Plus, especially with Lucene trunk, things all work off of bytes instead of >> chars, so they are a lot faster. > This one I hear a lot, but truthfully optimizing the APIs to get content in > does not speed up our implementation, especially the feature generation > is slow currently. Have you profiled it? I'd be curious to know, especially on large docs. In similar applications, I've seen the poor use of String and String concat be a huge bottleneck. I'll try to take a deeper look next week. Short term, it might be interesting to implement the Lucene analysis stuff as an OpenNLP Tokenizer. > > I also have practical questions. > How do you deal with different encodings then ? Does users have to pass > in everything as UTF-8 or convert everything to one encoding ? > How do you implement features like string to lowercase ? Have a look at the analysis stuff more. It's all taken care of. -Grant
