On 4/9/2011 6:18 PM, Olivier Grisel wrote: > Hi, > > I would like to know where it is recommended to use SimpleTokenizer or > a machine learning based model for tokenizing the input of > NameFinderME models (for English person, place and organization models > for instance). > > I suppose it is best to use the same tokenizer as the one used when > training the NameFinder models in the first place, but I did not find > any reference on the model downloading page at: > > http://opennlp.sourceforge.net/models-1.5/ > > Cheers, > Hi Olivier,
First, the page is a page for all the models both English an others. Usually what happens is you take the raw text... parse the text with the Sentence detector models, this separates the text into sentences that are more easily parsed. Then the sentences are parsed with the Tokenizer, which takes the sentences and breaks the sentence up into tokens (small pieces) usually words and moves punctuation away from words. Next, you use the Name finder models to parse the tokenized text. Most of the models take the tokenized text and produces the required output. There is no model that is trained on any one set of data... at least I don't believe so. Good Luck, James Kosin
