Re: obtaining data used to train OpenNLP models

Aditya Kulkarni Wed, 02 Apr 2014 05:30:19 -0700

Well, thanks Jorn. This settles it for me.
Let me see how both model together can be used in tandem. If any non
trivial observed, then shall share it.
-a



On Wed, Apr 2, 2014 at 4:25 PM, Jörn Kottmann <[email protected]> wrote:

> Hello,
>
> the training data for the tokenizer is not Open Source and can't be
> released due
> to copyright restrictions.
>
> For best performance you should create your own training data based on
> social media texts.
>
> Jörn
>
>
> On 03/31/2014 09:08 PM, Stuart Robinson wrote:
>
>> I've tried using the tokenizer model for English provided by OpenNLP:
>>
>> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>>
>> It's listed here, where it's described as "Trained on opennnlp training
>> data":
>>
>> http://opennlp.sourceforge.net/models-1.5/
>>
>> It works pretty well but I'm working on some social media text that has
>> some non-standard punctuation. For example, it's not uncommon for words to
>> be separated by a series of punctuation characters, like so:
>>
>> oooh,,,,go away fever and flu
>>
>> I want to train up a new model using text like this but don't want to
>> start
>> entirely from scratch. Is the training data for this model available from
>> OpenNLP? If so, I could experiment with supplementing its training data.
>> It
>> seems like sharing training data, and not just trained models, could be a
>> great service.
>>
>> Thanks,
>> Stuart Robinson
>>
>>
>

Re: obtaining data used to train OpenNLP models

Reply via email to