Hi Jörn, Thank you. I am interested in financial and legal domains, German language. Can you tell me where the number 10k come from? 10k sounds very appealing
So far I have done a training on the corpus possibly generated with a different tokenizer. Interestingly, on my short test set (comprising around 30 cases where opennlp model fails) the re-trained sentence splitter works better that the conventional de-sent. Such a testing is of course unfair :) Gotta perform comparison of the two models on a large volume of data to see the real picture. Best regards, Nikolai On Mon, Sep 25, 2017 at 4:13 PM, Joern Kottmann <[email protected]> wrote: > Hello, > > you can get good results with something in the range of 10k sentences. > You should not use fake/generated data for training since that usually > gives bad results. > > For what kind of domain do you train the models? Which languages? > > Jörn > > On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <[email protected]> wrote: > > Hi colleagues, > > > > I want to train my own models (possibly on a modified set of features) > for > > word tokenization and sentence detection, My question is how much > training > > data is required for a reliable model. > > > > I have been experimenting with training a word tokenizer model on 3 mln > > lines of fake corpus. Training procedure is space and memory consuming, > the > > resulting model is also large and I would like to optimize it by giving > > less data. > > > > Kind regards, > > Nikolai KROT >
