On 04/16/2012 01:13 PM, Jeyendran Balakrishnan wrote:
The github project for distributing model files sounds like a great idea.

It would also be very useful to get an authoritative list (with name,
description, and especially URL) of the training data files used to generate
each of the trained models.
Especially for models trained using OpenNLP training data, it is not clear
where the training data files are available.
By making the training data files available, OpenNLP can enable users to
augment them by adding their own training samples and retrain on the augment
data set.
Retraining would help significantly either in improving accuracy in
different problem domains (e.g., blog articles compared to newspaper
articles, etc) or covering for corner cases missed by the original training
data. Having the original training data will help immeasurably since it will
be much more manageable for users to merely add their own training samples,
compared to generating and annotating all the original training samples.

Any thoughts on this?


Well, we agree, but we cannot publish copyright protected training data
such as MUC 6/7, ACE, etc. Thats why we currently mostly focus on sharing
the code which is necessary to work with these data sets.
And data sets which can be distributed in some way under a restrictive (not AL compatible)
license are published in the github project.

What we have to do in the end is to start a community labeling project on
texts which can be licensed under an Open Source license.

We started to work on the tooling for the community labeling project, but are progressing very slowly, because we do not have enough resources to write all the tooling.
I am using a the existing stuff for work related projects and are able
to contribute bug fixes and improvements back.

Jörn

Reply via email to