Hi colleagues,

I want to train my own models (possibly on a modified set of features) for
word tokenization and sentence detection, My question is how much training
data is required for a reliable model.

I have been experimenting with training a word tokenizer model on 3 mln
lines of fake corpus. Training procedure is space and memory consuming, the
resulting model is also large and I would like to optimize it by giving
less data.

Kind regards,
Nikolai KROT

Reply via email to