+1 This question is not answered for me too. Should be great help to get it answered.
-aditya On Apr 1, 2014 12:38 AM, "Stuart Robinson" <[email protected]> wrote: > I've tried using the tokenizer model for English provided by OpenNLP: > > http://opennlp.sourceforge.net/models-1.5/en-token.bin > > It's listed here, where it's described as "Trained on opennnlp training > data": > > http://opennlp.sourceforge.net/models-1.5/ > > It works pretty well but I'm working on some social media text that has > some non-standard punctuation. For example, it's not uncommon for words to > be separated by a series of punctuation characters, like so: > > oooh,,,,go away fever and flu > > I want to train up a new model using text like this but don't want to start > entirely from scratch. Is the training data for this model available from > OpenNLP? If so, I could experiment with supplementing its training data. It > seems like sharing training data, and not just trained models, could be a > great service. > > Thanks, > Stuart Robinson >
