On a slightly different note, what you're trying to do is NOT
unreasonable... I'm thinking about the wider topic of
'probabilistic-classifiers' where 'training' essentially boils down to
gathering a bunch of probabilities. I can't think of any technical
reasons that would prohibit you to train an already trained model on
some extra data, other than the implementation of a given system. In
other words, if all you've got is a bunch of probabilities, why not be
able to add to them in the future? Nothing can stop you from doing that
but the implementation specifics.
To make things more concrete, consider an HMM POS-tagger. When you
'train' it, all you're doing is 'observing' what tag appears more
frequently before the tag you’re currently looking. Based on those
frequencies you built a probability matrix which you consult later on in
order to make predictions. Now, consider this...let's say you chose to
represent your matrix as a HashMap. So the entire model is a
HashMap...there is absolutely no problem retraining that whenever you
get some more data and without losing the original training
probabilities. Of course now you're gonna say that the tag-set might be
different, but what if the first time you trained you had access to the
first half of the Brown corpus and after a couple of months you managed
to find the other half...the tag-set doesn't change within the same
corpus so you can retrain your HMM without introducing noise...you just
'merge-with +' the maps at the end and voila! You should have the exact
same model as you would had if you'd trained on the entire Brown corpus.
Having said all that, I'll admit that I've never encountered any
machine-learning implementations that allow you to do that and I'm
wondering why...It's easy to implement and provides a ton of
flexibility. I've proven to myself that it can be done my implementing
what I described in the previous paragraph (an HMM POS-Tagger) and it
works, but generally speaking libraries and frameworks don't allow it...
I'm saying all that so you don't go thinking that what you tried is just
plain wrong...Well, it is the way you tried to but your underlying
thought is perfectly valid.
hope that helps, :)
Jim
On 03/09/13 17:42, Jim - FooBar(); wrote:
On 03/09/13 17:25, Danica Damljanovic wrote:
I was trying to find the original opennlp corpora used for training, but
could not get anything apart from the binary model...
Anyone has any idea on whether it is possible to get this and how?
If I'm not mistaken the original corpora cannot be re-distributed due
to licensing issues...However, don't take my word for it - someone
with the appropriate authority should answer this (someone from the
dev-team)...
Also, if I remember correctly, you can get a pretty decent
sentence-detecting model with less than 100 sentences, whereas for the
rest of the components (Tokenizer ,POSTagger, NER etc etc) you need
thousands of sentences!
Jim