Hi colleagues, I want to train my own models (possibly on a modified set of features) for word tokenization and sentence detection, My question is how much training data is required for a reliable model.
I have been experimenting with training a word tokenizer model on 3 mln lines of fake corpus. Training procedure is space and memory consuming, the resulting model is also large and I would like to optimize it by giving less data. Kind regards, Nikolai KROT
