Hi Jörn On Thu, Jun 21, 2012 at 9:50 AM, Jörn Kottmann <[email protected]> wrote: > Hello, > > the lexical unit in the POS Tagger is a token. For the > spanish POS models mutli-token chunks were converted > into one token separated by a "_". > > To what would you set the lexical unit separator in your case?
I do the same, but ... I m a bit confused of doing that because 1. I do not like pre and post process my data (here to add/remove an underscore to the multi-words terms) 2. A model trained with the API, which allows you not to preprocess your data, will be different from the model trained with the cli on the same data 3. Finally when you get a model you do not know which segmentation it assumes and how the multi-words terms are represented Since it is often convenient to use the cli it would be nice to set the token separator at least to be able to build the same models than with the API. > > The pos tag separator can already be configured in the class > which reads the input, but this parameter is not be set by the cli > tool. > > +1 to make both configurable from the command line. Nice. At least the idea has been proposed. If I have time... > > Jörn > > > On 06/20/2012 03:02 PM, Nicolas Hernandez wrote: >> >> Hi Everyone >> >> I need to train the POS tagger on multi-word terms. In other words, >> some of my lexical units are made several tokens separated by >> whitespace characters (like "traffic light", "feu rouge", "in order >> to", ...). >> >> I thing the training API allows to handle that but the command line >> tools cannot. The former takes the words of a sentence as an array of >> string. The latter assumes that the whitespace character is the >> lexical unit separator. >> A convention like concatenating all the words which are part of a >> multi word term is not a solution since in that case models built by >> the command line and by the API will be different. >> >> It would be great if we could set by parameter what is the lexical >> unit separator as well pos tag separator. >> >> What do you think ? >> >> /Nicolas >> >> [1] >> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api) > > -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
