> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <[email protected]> wrote:
>
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.
Great!
> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?
For the sentence-splitter, I imagine you could make use of the source side of
our parallel corpus, which has thousands to millions of sentences, one per line.
For tokenization (and normalization), we don't typically train models but
instead use a set of manually developed heuristics, which may or may not be
sentence-specific. See
https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl
How much training data do you generally need for each task?
>
> Jörn
>