Hello,

the training data for the tokenizer is not Open Source and can't be released due
to copyright restrictions.

For best performance you should create your own training data based on social media texts.

Jörn

On 03/31/2014 09:08 PM, Stuart Robinson wrote:
I've tried using the tokenizer model for English provided by OpenNLP:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

It's listed here, where it's described as "Trained on opennnlp training
data":

http://opennlp.sourceforge.net/models-1.5/

It works pretty well but I'm working on some social media text that has
some non-standard punctuation. For example, it's not uncommon for words to
be separated by a series of punctuation characters, like so:

oooh,,,,go away fever and flu

I want to train up a new model using text like this but don't want to start
entirely from scratch. Is the training data for this model available from
OpenNLP? If so, I could experiment with supplementing its training data. It
seems like sharing training data, and not just trained models, could be a
great service.

Thanks,
Stuart Robinson


Reply via email to