Re: Updating Pre-Trained Models

James Kosin Tue, 15 Feb 2011 16:15:48 -0800

Chris,

The tools are part of the source code.  The heart is the MAXENT network
that is trained with the data.  Most of the trainers are now in CLI
interfaces and usually work on the raw training data.  Where the raw
training data is inappropriate, converters have been built, also part of
the source.

The group is currently trying to start a push to find freely available
corpus (or training data).  Most of the training data we currently have
is copyrighted and can not be released in the raw data format.  The
models are fine, because they don't contain any of the original text. 
Unfortunately, this means any additional training is not possible
without having the entire training set of data.  Even if you did, most
of the training takes hours..... since they contain many many samples. 
Another unfortunate thing is most are news articles and are not taken
from other sources.

James

On 2/15/2011 10:37 AM, Chris Spencer wrote:
> I suspected this might be the case. What about the tools used to
> generate the model? Are those freely available or part of OpenNLP?
>
> I tried searching through OpenNLP's codebase, but I'm still new to it,
> so I'm not really sure what I'm looking for.
>
> Regards,
> Chris
>
> On Mon, Feb 14, 2011 at 5:58 PM, James Kosin <[email protected]> wrote:
>> Chris,
>>
>> Unfortunately, most... if not all, of the training data is not FREE or
>> openly available due to copyright.  If you would like to start a group
>> to engage in collecting non-copyrighted text and parse the data by hand
>> you are more than welcome and encouraged to do so.
>> Jorn or Jason may have a more complete set of training data and could
>> help if you pass on your samples.
>>
>> James
>>
>> On 2/13/2011 11:03 PM, Chris Spencer wrote:
>>> Where would we download the source data and tools used to generate the
>>> pretrained models available at
>>> http://opennlp.sourceforge.net/models-1.5/, specifically for the
>>> English Treebank Parser?
>>>
>>> I have a large corpus of hand-corrected sentence/parse-tree pairs, as
>>> well as an extended lexicon, and I'd like to incorporate these into
>>> the training data and retrain a new parser better fitted for my
>>> domain.
>>>
>>> Regards,
>>> Chris
>>

Re: Updating Pre-Trained Models

Reply via email to