On Jan 5, 2011, at 12:22 PM, Jörn Kottmann wrote: > On 1/5/11 4:44 PM, Olivier Grisel wrote: >> 2011/1/5 Jason Baldridge<[email protected]>: >>> This looks great, and it aligns with my own recent interest in large scale >>> NLP with Hadoop, including working with Wikipedia. I'll look at it more >>> closely later, but in principle I would be interested in having this brought >>> into the OpenNLP project in some way! >> Thanks for your interest. Don't hesitate to fork the repo on github to >> experiment with your own design ideas. OpenNLP methods often handle >> String[][] and Span[] data-structures where span start and end index >> either refer to char positions or token indices. It might be >> interesting make some generic wrappers for those data-structures from >> / to pig tuples by taking care of not reallocating memory when not >> necessary. >> > > Making OpenNLP faster is always nice, I believe we should one day > go away from String and use CharSequence instead, because that usually
In Lucene, we have recently converted to all byte based and it has been pretty significant in terms of speedup. -Grant
