Hi all,
every once in a while somebody here asks about OpenNLP models.
The typical answer then is that there are the models on Sourceforge
but that people should rather train their own. Then often somebody
mentions that support for this-or-that corpus format in OpenNLP
would be cool in order to train on this-or-that dataset. And that
is where it normally ends.
So I thought, why not add the ability to train OpenNLP models to DKPro Core?
DKPro Core is an open-source collection of components for Apache UIMA
integrating many NLP tools including OpenNLP into a uniform toolkit.
DKPro Core already offers readers for many corpus formats.
We have also started adding a dataset API to conveniently access
different standard corpora/datasets that are publicly/freely
available on the net.
And finally, we added support for the OpenNLP training tools for
tokenizer, sentence splitter, POS tagger, chunker, and name finder.
This makes it really easy to train new models for OpenNLP for many
datasets in just a few lines of code, e.g. [1]
DatasetLoader loader = new DatasetLoader(new File("cache"));
Dataset ds = loader.loadEnglishGUMCorpus();
CollectionReaderDescription trainReader = createReaderDescription(
Conll2006Reader.class,
Conll2006Reader.PARAM_PATTERNS, ds.getTrainingFiles(),
Conll2006Reader.PARAM_LANGUAGE, ds.getLanguage());
AnalysisEngineDescription trainer = createEngineDescription(
OpenNlpPosTaggerTrainer.class,
OpenNlpPosTaggerTrainer.PARAM_TARGET_LOCATION, new File(targetFolder,
"model.bin"),
OpenNlpPosTaggerTrainer.PARAM_LANGUAGE, ds.getLanguage());
SimplePipeline.runPipeline(trainReader, trainer);
Large parts of DKPro Core are ASL-licensed, but it is not an Apache project.
I hope this will be useful to people.
Happy for any feedback! :)
Cheers,
-- Richard
[1]
https://github.com/dkpro/dkpro-core/blob/master/dkpro-core-opennlp-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpPosTaggerTrainerTest.java