Thanks
I know I need a training model with the <space> but, but if I can add a
list of domain abbreviations, I hope, I will be able to solve some
problems I have with tokenization.
Also I will expand a bit the training set, with some sentences I may
find conflictive.
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or
there are other training annotated models
Joan
On 10/04/12 15:20, Jim - FooBar(); wrote:
On 10/04/12 14:18, Jörn Kottmann wrote:
On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
But you still cannot "train" anything (maxent/perceptron) on the
dictionary, can you???
One needs training data for that yes?
The dictionary is used to produce additional features to our standard
feature set.
Therefor you need training data to train our statistical tokenizer,
even so the feature
generation can use a dictionary to produce features.
Jörn
aha ok, that makes sense...
Jim
--
Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________
Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del
paper, els arbres t'ho agrairan.
_______________________________________________________________________________
/La informació d'aquest missatge electrònic és confidencial, personal i
intransferible i només està dirigida a la/les adreça/ces indicades a
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que
queda prohibida la seva divulgació, ús o distribució, completa o en
part, i li preguem esborri el missatge original juntament amb els seus
fitxers annexos sense llegir-lo ni gravar-lo./
/Gràcies./