I've been using OpenNLP for a few years and I find the best results occur when the models are generated using samples of the data they will be run against, one of the reasons I like the Maxent approach. I am not sure attempting to provide models will bear much fruit other than users will no longer be afraid of the licensing issues associated with using them in commercial systems. I do strongly think we should provide a modelbuilding framework (that calls the training api) and a default impl. Coincidentally....I have been building a framework and impl over the last few months that creates models based on seeding an iterative process with known entities and iterating through a set of supplied sentences to recursively create annotations, write them, create a maxentmodel, load the model, create more annotations based on the results (there is a validation object involved), and so on.... With this method I was able to create an NER model for people's names against a 200K sentence corpus that returns acceptable results just by starting with a list of five highly unambiguous names. I will propose the framework in more detail in the coming days and supply my impl if everyone is interested. As for the initial question, I would like to see OpenNLP provide a framework for rapidly/semi-automatically building models out of user data, and also performing entity resolution across documents, in order to assign a probability to whether the "Bob" in one document is the same as "Bob" in another. MG
On Tue, Oct 1, 2013 at 11:01 AM, Michael Schmitz <[email protected]>wrote: > Hi, I've used OpenNLP for a few years--in particular the chunker, POS > tagger, and tokenizer. We're grateful for a high performance library > with an Apache license, but one of our greatest complaints is the > quality of the models. Yes--we're aware we can train our own--but > most people are looking for something that is good enough out of the > box (we aim for this with out products). I'm not surprised that > volunteer engineers don't want to spend their time annotating data ;-) > > I'm curious what other people see as the biggest shortcomings for Open > NLP or the most important next steps for OpenNlp. I may have an > opportunity to contribute to the project and I'm trying to figure out > where the community thinks the biggest impact could be made. > > Peace. > Michael Schmitz >
