Yes Jim, you need to train and that is the right format. Thank you. The abbreviation dictionary can increase the effectiveness while dealing with abbreviations, but you still need the model.
Just a note. Often you don't need to convert to the OpenNLP format by yourself, you can use the formaters instead. I will explain how to use it in 1.5.2-incubating. This process were improved in trunk and in it will be a lot easier in the next release. The tool to use is the *SentenceDetectorConverter*: $ bin/opennlp *SentenceDetectorConverter* Usage: opennlp SentenceDetectorConverter format ... You need to know the available formats for now. They are *conllx*, *pos*, and *namefinder* (it has been improved already and the future release will list it for you) For example to create the Sentence Detector training data from conllx: $ bin/opennlp* SentenceDetectorConverter conllx* Usage: opennlp SentenceDetectorConverter conllx -encoding charsetName -data sampleData -detokenizer dictionary Arguments description: -encoding charsetName -data sampleData -detokenizer dictionary You will need a detokenizer dictionary. There is one for English here: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml?view=co William On Tue, Apr 10, 2012 at 10:05 AM, [email protected] < [email protected]> wrote: > I checked the English models from download page. They were not trained > using an abbreviation dictionary. If they were you would be able to see it > if you extract the model like a zip file. So we don't have a basic English > abbreviation dictionary for you to start with, you will need to create > yours from scratch. > > To create your own abbreviation dictionary use *DictionaryBuilder* tool: > > $ *bin/opennlp DictionaryBuilder* > Usage: opennlp DictionaryBuilder -inputFile in -outputFile out [-encoding > charsetName] > > Arguments description: > -inputFile in > Plain file with one entry per line > -outputFile out > The dictionary file. > -encoding charsetName > specifies the encoding which should be used for reading and writing > text. If not specified the system default will be used. > > The output looks like this: > http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/abb.xml?view=markup > > On Tue, Apr 10, 2012 at 6:31 AM, Jim - FooBar(); <[email protected]>wrote: > >> To train models of any type you need training data...The pretrained >> english tokenizer was trained on the CoNNL shared task if i remember >> correctly...Maybe one of the developers can shine some light on >> this...Anyway i don't think you need a dictionary but training data of the >> following form : >> >> Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a >> nonexecutive director Nov. 29<SPLIT>. >> Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing >> group<SPLIT>. >> Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated >> Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British >> industrial conglomerate<SPLIT>. >> >> Hope that helps, >> >> Jim >> >> p.s: Did you mean an abbreviation dictionary? Well, you can't really >> train a model using an abbreviation dictionary... >> >> >> On 10/04/12 09:02, Joan Codina wrote: >> >>> >>> I sent this some days before, but I got no answer :-(( : >>> >>> To train a tokenizer I can use a dictionary, but >>> where is the dictionary used to train the current English model? and >>> where can I find information about the dictionary format? , so I can, >>> at least, generate my own one. >>> >>> thanks >>> Joan Codina >>> >>> >> >
