On Jan 13, 2011, at 10:55 AM, Jörn Kottmann wrote: > On 1/11/11 2:21 PM, Olivier Grisel wrote: >> 2011/1/4 Olivier Grisel<[email protected]>: >>> I plan to give more details in a blog post soon (tm). >> Here it is: >> >> >> http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html >> >> It gives a bit more context and some additional results and clues for >> improvements and potential new usages. >> > Now I read this post too, sounds very interesting. > > What is the biggest training file for the name finder you can generate with > this method? > > I think we need MapReduce training support for OpenNLP. Actually that is > already on my > todo list, but currently I am still busy with the Apache migration and the > next release. > Anyway I hope we can get that done at least partially for the name finder > this year. >
One of the things that I mentioned earlier is that it might make sense to just build on Mahout for this stuff. We'd love to do MaxEnt, but we also have a lot of other classifiers (bayes, SGD, Random Forests). To me, if OpenNLP was abstracted a little bit from the classification algorithm, that would make it easier for people to plug-in/try out their own, including the Pig stuff Olivier is suggesting. -Grant
