2011/1/5 Jason Baldridge <[email protected]>: > This looks great, and it aligns with my own recent interest in large scale > NLP with Hadoop, including working with Wikipedia. I'll look at it more > closely later, but in principle I would be interested in having this brought > into the OpenNLP project in some way!
Thanks for your interest. Don't hesitate to fork the repo on github to experiment with your own design ideas. OpenNLP methods often handle String[][] and Span[] data-structures where span start and end index either refer to char positions or token indices. It might be interesting make some generic wrappers for those data-structures from / to pig tuples by taking care of not reallocating memory when not necessary. Mining a medium / large scale corpus in an almost interactive ways with the pig shell (grunt) is a great way to quickly test ideas and prototypes to tap into the unreasonable effectiveness of data. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
