Hello, you can get good results with something in the range of 10k sentences. You should not use fake/generated data for training since that usually gives bad results.
For what kind of domain do you train the models? Which languages? Jörn On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <[email protected]> wrote: > Hi colleagues, > > I want to train my own models (possibly on a modified set of features) for > word tokenization and sentence detection, My question is how much training > data is required for a reliable model. > > I have been experimenting with training a word tokenizer model on 3 mln > lines of fake corpus. Training procedure is space and memory consuming, the > resulting model is also large and I would like to optimize it by giving > less data. > > Kind regards, > Nikolai KROT
