Chris, The tools are part of the source code. The heart is the MAXENT network that is trained with the data. Most of the trainers are now in CLI interfaces and usually work on the raw training data. Where the raw training data is inappropriate, converters have been built, also part of the source.
The group is currently trying to start a push to find freely available corpus (or training data). Most of the training data we currently have is copyrighted and can not be released in the raw data format. The models are fine, because they don't contain any of the original text. Unfortunately, this means any additional training is not possible without having the entire training set of data. Even if you did, most of the training takes hours..... since they contain many many samples. Another unfortunate thing is most are news articles and are not taken from other sources. James On 2/15/2011 10:37 AM, Chris Spencer wrote: > I suspected this might be the case. What about the tools used to > generate the model? Are those freely available or part of OpenNLP? > > I tried searching through OpenNLP's codebase, but I'm still new to it, > so I'm not really sure what I'm looking for. > > Regards, > Chris > > On Mon, Feb 14, 2011 at 5:58 PM, James Kosin <[email protected]> wrote: >> Chris, >> >> Unfortunately, most... if not all, of the training data is not FREE or >> openly available due to copyright. If you would like to start a group >> to engage in collecting non-copyrighted text and parse the data by hand >> you are more than welcome and encouraged to do so. >> Jorn or Jason may have a more complete set of training data and could >> help if you pass on your samples. >> >> James >> >> On 2/13/2011 11:03 PM, Chris Spencer wrote: >>> Where would we download the source data and tools used to generate the >>> pretrained models available at >>> http://opennlp.sourceforge.net/models-1.5/, specifically for the >>> English Treebank Parser? >>> >>> I have a large corpus of hand-corrected sentence/parse-tree pairs, as >>> well as an extended lexicon, and I'd like to incorporate these into >>> the training data and retrain a new parser better fitted for my >>> domain. >>> >>> Regards, >>> Chris >>
