Well, thanks Jorn. This settles it for me. Let me see how both model together can be used in tandem. If any non trivial observed, then shall share it. -a
On Wed, Apr 2, 2014 at 4:25 PM, Jörn Kottmann <[email protected]> wrote: > Hello, > > the training data for the tokenizer is not Open Source and can't be > released due > to copyright restrictions. > > For best performance you should create your own training data based on > social media texts. > > Jörn > > > On 03/31/2014 09:08 PM, Stuart Robinson wrote: > >> I've tried using the tokenizer model for English provided by OpenNLP: >> >> http://opennlp.sourceforge.net/models-1.5/en-token.bin >> >> It's listed here, where it's described as "Trained on opennnlp training >> data": >> >> http://opennlp.sourceforge.net/models-1.5/ >> >> It works pretty well but I'm working on some social media text that has >> some non-standard punctuation. For example, it's not uncommon for words to >> be separated by a series of punctuation characters, like so: >> >> oooh,,,,go away fever and flu >> >> I want to train up a new model using text like this but don't want to >> start >> entirely from scratch. Is the training data for this model available from >> OpenNLP? If so, I could experiment with supplementing its training data. >> It >> seems like sharing training data, and not just trained models, could be a >> great service. >> >> Thanks, >> Stuart Robinson >> >> >
