Thanks
I know I need a training model with the <space> but, but if I can add a list of domain abbreviations, I hope, I will be able to solve some problems I have with tokenization. Also I will expand a bit the training set, with some sentences I may find conflictive.
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or there are other training annotated models


Joan



On 10/04/12 15:20, Jim - FooBar(); wrote:
On 10/04/12 14:18, Jörn Kottmann wrote:
On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:

But you still cannot "train" anything (maxent/perceptron) on the dictionary, can you??? One needs training data for that yes?

The dictionary is used to produce additional features to our standard feature set. Therefor you need training data to train our statistical tokenizer, even so the feature
generation can use a dictionary to produce features.

Jörn

aha ok, that makes sense...

Jim

--

Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________

Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del paper, els arbres t'ho agrairan. _______________________________________________________________________________

/La informació d'aquest missatge electrònic és confidencial, personal i intransferible i només està dirigida a la/les adreça/ces indicades a dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que queda prohibida la seva divulgació, ús o distribució, completa o en part, i li preguem esborri el missatge original juntament amb els seus fitxers annexos sense llegir-lo ni gravar-lo./

/Gràcies./

Reply via email to