2011/10/3 Em <[email protected]>: > Hello list, > > I am currently trying to create a person-model for a specific domain for > testing purposes. > While the general suggestion is to have around 10k-15k sentences, I > retrain and reevaluate the outcome of my trainingdata while tagging new > sentences. > > At the moment I am under 1k sentences. However I asked myself whether it > makes sense to include sentences without persons or not. > While playing around there was no clear conclusion to draw: Precision > almost always increased when I included sentences without persons while > *sometimes* recall dropped a little bit. > > Is there a general direction for tagging training data? > > Btw.: This is the first time I am preparing training data. I never saw a > complete training-dataset before.
The general rule of thumb is to have a training set that looks as much as possible like the data you will be applying your model to. So if you will encounter sentences without names in your production data, include a similar ratio in your training set. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
