> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com> wrote: > > I am happy to support a bit with this, we can also see if things in OpenNLP > need to be changed to make this work smoothly.
Great! > One challenge is to train OpenNLP on all the languages you support. Do you > have training data that could be used to train the tokenizer and sentence > detector? For the sentence-splitter, I imagine you could make use of the source side of our parallel corpus, which has thousands to millions of sentences, one per line. For tokenization (and normalization), we don't typically train models but instead use a set of manually developed heuristics, which may or may not be sentence-specific. See https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl How much training data do you generally need for each task? > > Jörn >