On 4/27/11 9:04 PM, Chris Collins wrote:
1) I can understand you cannot distribute the original training set for english 
etc because of perhaps distribution rights.  Knowing where or at least the 
flavor of where the original corpus came from would be nice.  What type of 
people and how many people were used in labeling the data and how much of it 
would be useful in determining if we are off.

This is actually on my to do list. We need to create a wiki page or so to document the training data the english models have been trained on. All the other models are mostly trained on public data.

2) What are the planned models, are there any existing open source projects 
that want help on these exercises?

There are no plans from my side, if you know of a public corpus, you would like to train OpenNLP on we are happy to add native support for it, like we did for a couple of corpora already.

3) I see that with 1.5 there seems to be better support for taking training 
sets from other file formats.  What are the motivations?  Is it so that ONLP 
can take advantage of existing training sets that will help with 2) or is it 
generally to help the community interoperate better?

Form my side the main motivation was to have data sets people can test OpenNLP on, if someone
wants to contribute something he can now at least test the modification.
Another motivation is that the more languages and corpora we support the more people are interested
in working on and with OpenNLP.

BTW, we had a discussion here to start a wikinews (and also wikipedia) content based corpus project,
maybe you would be interested in helping with that.

Jörn


Reply via email to