To train models of any type you need training data...The pretrained
english tokenizer was trained on the CoNNL shared task if i remember
correctly...Maybe one of the developers can shine some light on
this...Anyway i don't think you need a dictionary but training data of
the following form :
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a
nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing
group<SPLIT>.
Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated
Gold Fields PLC<SPLIT>, was named a nonexecutive director of this
British industrial conglomerate<SPLIT>.
Hope that helps,
Jim
p.s: Did you mean an abbreviation dictionary? Well, you can't really
train a model using an abbreviation dictionary...
On 10/04/12 09:02, Joan Codina wrote:
I sent this some days before, but I got no answer :-(( :
To train a tokenizer I can use a dictionary, but
where is the dictionary used to train the current English model? and
where can I find information about the dictionary format? , so I can,
at least, generate my own one.
thanks
Joan Codina