To train models of any type you need training data...The pretrained english tokenizer was trained on the CoNNL shared task if i remember correctly...Maybe one of the developers can shine some light on this...Anyway i don't think you need a dictionary but training data of the following form :

Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>. Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>. Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British industrial conglomerate<SPLIT>.

Hope that helps,

Jim

p.s: Did you mean an abbreviation dictionary? Well, you can't really train a model using an abbreviation dictionary...

On 10/04/12 09:02, Joan Codina wrote:

I sent this some days before, but I got no answer :-((  :

To train a tokenizer I  can use a dictionary, but
where is the dictionary used to train the current English model? and
where can I find information about the dictionary format? , so I can, at least, generate my own one.

thanks
Joan Codina


Reply via email to