Hello,

you can get good results with something in the range of  10k sentences.
You should not use fake/generated data for training since that usually
gives bad results.

For what kind of domain do you train the models? Which languages?

Jörn

On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <[email protected]> wrote:
> Hi colleagues,
>
> I want to train my own models (possibly on a modified set of features) for
> word tokenization and sentence detection, My question is how much training
> data is required for a reliable model.
>
> I have been experimenting with training a word tokenizer model on 3 mln
> lines of fake corpus. Training procedure is space and memory consuming, the
> resulting model is also large and I would like to optimize it by giving
> less data.
>
> Kind regards,
> Nikolai KROT

Reply via email to