Hi all, I have created a script [1] to train OpenNLP models from Universal Dependencies [2] data to give OpenNLP models that can be distributed under the Apache license,
The script automates the training of tokenizer, sentence, and POS models for English, Dutch, French, German, and Italian. (The NameFinder does not currently support the input annotation format so those models will come later.) While decent performance is always beneficial, the primary purpose of this task is to provide working OpenNLP models the project can distribute. Having these models will help reduce the barrier to entry for users new to OpenNLP. Once voted and approved, the trained models will be pushed to Subversion alongside the current OpenNLP language detection model. From there, the models can be made available for download on the OpenNLP website and programmatically through OPENNLP-1318 [3]. The script to train the models and instructions will be added to the OpenNLP repository. To use the script: 1. Download and extract UD. 2. Download and extract OpenNLP. 3. Create a directory to store the trained models. 3. Modify the ud-train.sh script to set the path to those three directories. 4. Execute the ud-train.sh script. The training log, evaluation output, and model files will be saved to the $OUTPUT_MODELS directory. Models and the output files I trained using this script can be viewed on Dropbox [4]. Before calling a vote to release the models, I would like to see if there is any feedback on the process or direction. If you have any comments please feel free. Thanks, Jeff [1] https://github.com/jzonthemtn/opennlp/blob/ud-models/scripts/ud-train.sh [2] https://universaldependencies.org/ [3] https://issues.apache.org/jira/browse/OPENNLP-1318 [4] https://www.dropbox.com/sh/p8focuz0qwvw84b/AAC6GqO8mqZn_xkAqHZsVAsoa?dl=0