On 4/27/11 9:04 PM, Chris Collins wrote:
1) I can understand you cannot distribute the original training set for english
etc because of perhaps distribution rights. Knowing where or at least the
flavor of where the original corpus came from would be nice. What type of
people and how many people were used in labeling the data and how much of it
would be useful in determining if we are off.
This is actually on my to do list. We need to create a wiki page or so
to document the training
data the english models have been trained on. All the other models are
mostly trained on public data.
2) What are the planned models, are there any existing open source projects
that want help on these exercises?
There are no plans from my side, if you know of a public corpus, you
would like to train OpenNLP on
we are happy to add native support for it, like we did for a couple of
corpora already.
3) I see that with 1.5 there seems to be better support for taking training
sets from other file formats. What are the motivations? Is it so that ONLP
can take advantage of existing training sets that will help with 2) or is it
generally to help the community interoperate better?
Form my side the main motivation was to have data sets people can test
OpenNLP on, if someone
wants to contribute something he can now at least test the modification.
Another motivation is that the more languages and corpora we support the
more people are interested
in working on and with OpenNLP.
BTW, we had a discussion here to start a wikinews (and also wikipedia)
content based corpus project,
maybe you would be interested in helping with that.
Jörn