I've tried using the tokenizer model for English provided by OpenNLP:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

It's listed here, where it's described as "Trained on opennnlp training
data":

http://opennlp.sourceforge.net/models-1.5/

It works pretty well but I'm working on some social media text that has
some non-standard punctuation. For example, it's not uncommon for words to
be separated by a series of punctuation characters, like so:

oooh,,,,go away fever and flu

I want to train up a new model using text like this but don't want to start
entirely from scratch. Is the training data for this model available from
OpenNLP? If so, I could experiment with supplementing its training data. It
seems like sharing training data, and not just trained models, could be a
great service.

Thanks,
Stuart Robinson

Reply via email to