Hi all,

I have created a script [1] to train OpenNLP models from Universal
Dependencies [2] data to give OpenNLP models that can be distributed under
the Apache license,

The script automates the training of tokenizer, sentence, and POS models
for English, Dutch, French, German, and Italian. (The NameFinder does not
currently support the input annotation format so those models will come
later.) While decent performance is always beneficial, the primary purpose
of this task is to provide working OpenNLP models the project can
distribute. Having these models will help reduce the barrier to entry for
users new to OpenNLP.

Once voted and approved, the trained models will be pushed to Subversion
alongside the current OpenNLP language detection model. From there, the
models can be made available for download on the OpenNLP website and
programmatically through OPENNLP-1318 [3]. The script to train the models
and instructions will be added to the OpenNLP repository.

To use the script:

1. Download and extract UD.
2. Download and extract OpenNLP.
3. Create a directory to store the trained models.
3. Modify the ud-train.sh script to set the path to those three directories.
4. Execute the ud-train.sh script.

The training log, evaluation output, and model files will be saved to the
$OUTPUT_MODELS directory. Models and the output files I trained using this
script can be viewed on Dropbox [4].

Before calling a vote to release the models, I would like to see if there
is any feedback on the process or direction. If you have any comments
please feel free.

Thanks,
Jeff

[1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh
[2] https://universaldependencies.org/
[3] https://issues.apache.org/jira/browse/OPENNLP-1318
[4]
https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0

Reply via email to