I've tried using the tokenizer model for English provided by OpenNLP: http://opennlp.sourceforge.net/models-1.5/en-token.bin
It's listed here, where it's described as "Trained on opennnlp training data": http://opennlp.sourceforge.net/models-1.5/ It works pretty well but I'm working on some social media text that has some non-standard punctuation. For example, it's not uncommon for words to be separated by a series of punctuation characters, like so: oooh,,,,go away fever and flu I want to train up a new model using text like this but don't want to start entirely from scratch. Is the training data for this model available from OpenNLP? If so, I could experiment with supplementing its training data. It seems like sharing training data, and not just trained models, could be a great service. Thanks, Stuart Robinson
