On 4/21/2012 12:40 PM, Jim - FooBar(); wrote: > On 13/02/12 23:07, Michael Collins wrote: >> Does opennlp provide a way to create the *.train file based on a body >> of text which I provide, or is the *.train file created another way. > Apart from the sentence detector there is no way to automatically > create training data for other tasks (POS,NER etc)...these are often > language and domain dependant. For the sentence detector however it is > easy to create your own private training data (as Jorn said) targeted > especially for your problem domain. assuming of course that the > pre-trained model is not good enough for you...i find it's pretty > good! :) > > Jim The training data is based on corpus of text already parsed for POS, Name or other reasons. Usually, they are hand done ... or generated and rechecked by humans to verify accuracy. Unfortunately for most, the corpus' are usually copyrighted text meaning they can not be freely distributed. Most provide some data either only the data needed to be merged with the original text... ie: you have to run scripts to take multiple files and merge them with the data to get the final corpus or they only provide small samples of some corpus'. Either way, the copyright usually prohibits commercial usage or usage for any reason other than research.
We do have projects we want to start to start our own corpus based on freely available text that we can distribute freely for any purpose based on OpenNLP. This is also why our models are currently on sourceforge only... due to distributing licenses that are not Apache friendly. James
