On 4/22/2012 6:18 AM, Jim - FooBar(); wrote: > On 21/04/12 23:30, James Kosin wrote: >> On 4/21/2012 12:40 PM, Jim - FooBar(); wrote: >>> On 13/02/12 23:07, Michael Collins wrote: >>>> Does opennlp provide a way to create the *.train file based on a body >>>> of text which I provide, or is the *.train file created another way. >>> Apart from the sentence detector there is no way to automatically >>> create training data for other tasks (POS,NER etc)...these are often >>> language and domain dependant. For the sentence detector however it is >>> easy to create your own private training data (as Jorn said) targeted >>> especially for your problem domain. assuming of course that the >>> pre-trained model is not good enough for you...i find it's pretty >>> good! :) >>> >>> Jim >> Also, unlike a lot of the other models, the sentence detector can >> actually be trained and works quite well with just a few sentences to >> train on. ~20-30 does really well. >> >> James >> > > Wow!!! did not know that!!! I thought the sentence detector needs > thousands of sentences just like the other models! Thanks James... > > Jim Jim,
The sentence detector is probably the simplest model next would be the tokenizer. The sentence detector only requires to be trained on knowing the end-of-sentence. Most cases this is a '.' or other terminating punctuation. I even trained with a few sentences with abbreviations that had a '.' in them as well. Of course in my case and with so few sentence samples, I have to use the parameter to change the cutoff to 1 instead of the default 5. The tokenizer though is training for more than just splitting punctuation .... so, it will require a bit more. The harder ones like POS, NameFinder, etc ... require large volumes of data to be trained reliably. James
