This was not intentional. As I said I wanted to use the POSTaggerTrainer with a tagset whose values would be word lemma. Consequently instead of having thirty tag values, I had thirty thousand distinct tag values... I did it by curiosity, the approach works fine for predicting gender, number, person...
Here is an excerpt of my corpus Il_il est_être vrai_vrai ,_, si_si l'on_on en_en croit_croire le_le rapport_rapport Delors_Delors que_que c'_ce est_être un_un organisme_organisme du_de#:#le même_même genre_genre que_que l'on_on veut_vouloir créer_créer au_à#:#le bénéfice_bénéfice de_de l'_le Europe_Europe tout_entière_tout#:#entière ._. I open the following issue https://issues.apache.org/jira/browse/OPENNLP-578 /Nicolas On Mon, May 13, 2013 at 6:13 PM, Jörn Kottmann <[email protected]> wrote: > On 05/13/2013 03:44 PM, Nicolas Hernandez wrote: >> >> I ve tried to use the postagger command to learn models of various >> morphological features. Even if I know it is not adapted to, I also >> try to build a model for lemma tagging.... > > > Looks like we do not support strings for features larger than 64KB, as > pointed out > this seems to be a bug in our serializer code. Anyway, why do you use such > large > strings for features? Is this intentional? > > Would you mind to open a jira issue for this? > > Thanks, > Jörn -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
