> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com> wrote:
> 
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.

Great!


> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?

For the sentence-splitter, I imagine you could make use of the source side of 
our parallel corpus, which has thousands to millions of sentences, one per line.

For tokenization (and normalization), we don't typically train models but 
instead use a set of manually developed heuristics, which may or may not be 
sentence-specific. See

        
https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl

How much training data do you generally need for each task?


> 
> Jörn
> ​

Reply via email to