Training data sets size for Word Tokenizer and Sentence Detector

Nikolai Krot Thu, 21 Sep 2017 04:13:50 -0700

Hi colleagues,

I want to train my own models (possibly on a modified set of features) for
word tokenization and sentence detection, My question is how much training
data is required for a reliable model.


I have been experimenting with training a word tokenizer model on 3 mln
lines of fake corpus. Training procedure is space and memory consuming, the
resulting model is also large and I would like to optimize it by giving
less data.

Kind regards,
Nikolai KROT

Training data sets size for Word Tokenizer and Sentence Detector

Reply via email to