Re: abbreviation diccionary format

Joan Codina Wed, 11 Apr 2012 00:17:13 -0700

Ok,
I will try it,
but this does not introduce a bias, as the de-tokenizer has a few rules?

There is no way to do incremental train of an existing model, or justadd a dictionary of abbreviations to an existing model?


Joan

On 10/04/12 16:51, Jörn Kottmann wrote:

On 04/10/2012 04:44 PM, Joan Codina wrote:
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markupwhich only contains 121 sentences. i don't know if this is enough orthere are other training annotated models
No, that is not enough. Get some training data set for the languageyou need. Most of the data setsreferenced in the Corpora section can be used to train the tokenizer.These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.

Jörn


--

Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra

_______________________________________________________________________________

Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en casde que ho sigui, pensa que si ho fas a doble cara estalvies un 25% delpaper, els arbres t'ho agrairan._______________________________________________________________________________

/La informació d'aquest missatge electrònic és confidencial, personal iintransferible i només està dirigida a la/les adreça/ces indicades adalt. Si vostè llegeix aquest missatge per equivocació, l'informem quequeda prohibida la seva divulgació, ús o distribució, completa o enpart, i li preguem esborri el missatge original juntament amb els seusfitxers annexos sense llegir-lo ni gravar-lo./


/Gràcies./

Re: abbreviation diccionary format

Reply via email to