Hi Jörn,

Thank you. I am interested in financial and legal domains, German language.
Can you tell me where the number 10k come from? 10k sounds very appealing

So far I have done a training on the corpus possibly generated with a
different tokenizer. Interestingly, on my short test set (comprising around
30 cases where opennlp model fails) the re-trained sentence splitter works
better that the conventional de-sent. Such a testing is of course unfair :)
Gotta perform comparison of the two models on a large volume of data to see
the real picture.

Best regards,
Nikolai


On Mon, Sep 25, 2017 at 4:13 PM, Joern Kottmann <[email protected]> wrote:

> Hello,
>
> you can get good results with something in the range of  10k sentences.
> You should not use fake/generated data for training since that usually
> gives bad results.
>
> For what kind of domain do you train the models? Which languages?
>
> Jörn
>
> On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <[email protected]> wrote:
> > Hi colleagues,
> >
> > I want to train my own models (possibly on a modified set of features)
> for
> > word tokenization and sentence detection, My question is how much
> training
> > data is required for a reliable model.
> >
> > I have been experimenting with training a word tokenizer model on 3 mln
> > lines of fake corpus. Training procedure is space and memory consuming,
> the
> > resulting model is also large and I would like to optimize it by giving
> > less data.
> >
> > Kind regards,
> > Nikolai KROT
>

Reply via email to