Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Grant Ingersoll Sat, 22 Jan 2011 05:06:15 -0800

On Jan 5, 2011, at 12:22 PM, Jörn Kottmann wrote:

> On 1/5/11 4:44 PM, Olivier Grisel wrote:
>> 2011/1/5 Jason Baldridge<[email protected]>:
>>> This looks great, and it aligns with my own recent interest in large scale
>>> NLP with Hadoop, including working with Wikipedia. I'll look at it more
>>> closely later, but in principle I would be interested in having this brought
>>> into the OpenNLP project in some way!
>> Thanks for your interest. Don't hesitate to fork the repo on github to
>> experiment with your own design ideas. OpenNLP methods often handle
>> String[][] and Span[] data-structures where span start and end index
>> either refer to char positions or token indices. It might be
>> interesting make some generic wrappers for those data-structures from
>> / to pig tuples by taking care of not reallocating memory when not
>> necessary.
>> 
> 
> Making OpenNLP faster is always nice, I believe we should one day
> go away from String and use CharSequence instead, because that usually


In Lucene, we have recently converted to all byte based and it has been pretty 
significant in terms of speedup.

-Grant

Re: pignlproc: a new tool to build NER models from Wikipedia / DBpedia dumps

Reply via email to