On 1/19/11 10:47 PM, [email protected] wrote:
How that models were created at first time? I mean, is there a script to
create all models if we have the necessary copora?
It would be nice to have more details about each model, like its accuracy
and F1-score, and info about how it was trained: number of iterations,
cutoff. Maybe the script should get these data while training and prepare a
information page.
How the models are created depends. For 1.5.0 there is a script
which can be used to train most of them on the historically used
training data,
but that is really not open. That is why we need to find new sources
for training data which could at least be reproduced by others
after buying a corpus or signing contracts. The formats package
is a step in this way, but it will still take more time.
Preferable would be to have completely open data, maybe based
on wikinews articles, like I describes in the mail over at the user
mailing list.
You are right, we need to put together a list which at least says on
which data the
models have been trained, some of our historic training data is also hand
corrected or extended with additional data. So it is not possible to
re-create
that training data from the corpus it was derived from.
That one might surprise you, the models actually contain a properties
file which contains certain meta information about the training.
Which is cutoff, iterations, hash sum of the training event stream, opennlp
version (for compatibility checks), and depending on the component some
other things.
Having open data is also critical for regression testing, if it depends on
the closed data we cannot really have an open development process.
Jörn