On a slightly different note, what you're trying to do is NOT unreasonable... I'm thinking about the wider topic of 'probabilistic-classifiers' where 'training' essentially boils down to gathering a bunch of probabilities. I can't think of any technical reasons that would prohibit you to train an already trained model on some extra data, other than the implementation of a given system. In other words, if all you've got is a bunch of probabilities, why not be able to add to them in the future? Nothing can stop you from doing that but the implementation specifics.

To make things more concrete, consider an HMM POS-tagger. When you 'train' it, all you're doing is 'observing' what tag appears more frequently before the tag you’re currently looking. Based on those frequencies you built a probability matrix which you consult later on in order to make predictions. Now, consider this...let's say you chose to represent your matrix as a HashMap. So the entire model is a HashMap...there is absolutely no problem retraining that whenever you get some more data and without losing the original training probabilities. Of course now you're gonna say that the tag-set might be different, but what if the first time you trained you had access to the first half of the Brown corpus and after a couple of months you managed to find the other half...the tag-set doesn't change within the same corpus so you can retrain your HMM without introducing noise...you just 'merge-with +' the maps at the end and voila! You should have the exact same model as you would had if you'd trained on the entire Brown corpus.

Having said all that, I'll admit that I've never encountered any machine-learning implementations that allow you to do that and I'm wondering why...It's easy to implement and provides a ton of flexibility. I've proven to myself that it can be done my implementing what I described in the previous paragraph (an HMM POS-Tagger) and it works, but generally speaking libraries and frameworks don't allow it...

I'm saying all that so you don't go thinking that what you tried is just plain wrong...Well, it is the way you tried to but your underlying thought is perfectly valid.

hope that helps, :)

Jim



On 03/09/13 17:42, Jim - FooBar(); wrote:
On 03/09/13 17:25, Danica Damljanovic wrote:
I was trying to find the original opennlp corpora used for training, but
could not get anything apart from the binary model...

Anyone has any idea on whether it is possible to get this and how?

If I'm not mistaken the original corpora cannot be re-distributed due to licensing issues...However, don't take my word for it - someone with the appropriate authority should answer this (someone from the dev-team)...

Also, if I remember correctly, you can get a pretty decent sentence-detecting model with less than 100 sentences, whereas for the rest of the components (Tokenizer ,POSTagger, NER etc etc) you need thousands of sentences!

Jim

Reply via email to