Re: Distributing our statistical models

Jörn Kottmann Wed, 19 Jan 2011 14:05:06 -0800

On 1/19/11 10:47 PM, [email protected] wrote:

How that models were created at first time? I mean, is there a script to
create all models if we have the necessary copora?
It would be nice to have more details about each model, like its accuracy
and F1-score, and info about how it was trained: number of iterations,
cutoff. Maybe the script should get these data while training and prepare a
information page.

How the models are created depends. For 1.5.0 there is a script

which can be used to train most of them on the historically usedtraining data,

but that is really not open. That is why we need to find new sources
for training data which could at least be reproduced by others
after buying a corpus or signing contracts. The formats package
is a step in this way, but it will still take more time.

Preferable would be to have completely open data, maybe based

on wikinews articles, like I describes in the mail over at the usermailing list.

You are right, we need to put together a list which at least says onwhich data the

models have been trained, some of our historic training data is also hand

corrected or extended with additional data. So it is not possible tore-create

that training data from the corpus it was derived from.

That one might surprise you, the models actually contain a properties
file which contains certain meta information about the training.
Which is cutoff, iterations, hash sum of the training event stream, opennlp
version (for compatibility checks), and depending on the component some
other things.

Having open data is also critical for regression testing, if it depends on
the closed data we cannot really have an open development process.

Jörn

Re: Distributing our statistical models

Reply via email to