Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Nicolas Hernandez Fri, 22 Jun 2012 02:14:14 -0700

On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote:
> To make this reliable the tokenization on new unseen
> text must be done correctly. For the spanish data we had
> a special chunker to put the multi word units into one token.
>
> Do you use something like that?

A kind of.

A POS tagger should always be used with the tokenizer which was used
for the POS tagger trainer.
The more you manage to automatically recognize multi word units, the
more you should consider them for your training. It makes easier the
further syntax analysis.
(interaction can exist between the two processes (i.e. recognizing
multi words units can require a pos tagging of simple words) but I won
t speak about that).

In the FTB, the notion of multi-word expression (compound) is a bit
fuzzy and it is not simple to automatically process the same
tokenization which is assumed.

>
> What do you think about outputting a special pos tag to indicate
> that it is a multi-word tag?

I do not see the point at the pos analysis level. We must try to
re-use what it exists to preserve the compatibility.
Here I should mention that [Green:2011:MEI:2145432.2145516] (see
below) introduced multi-word tags (one for each pos label mw noun, mw
verb, mw preposition...) at the chunk and syntax levels. This is
probably a nice idea.

The cli should offer a way to specify what are multi-word expressions
in the data.
This can be done by using a parameter to set what is the token
separator character.

Models built from the cli or the API should be the same.
One way to do that is to use a parameter to set what is the multi-word
separator character and to turn this separator character into
whitespace before training the model.
For example with " " as the token separator character, "_" as the
multi-word separator character and "/" as the pos tag separator, the
following sentence
Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
sleep/VB longer/RB
should be turn into
String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
"earlier", "in order to", "sleep", "longer"};
(note "in order to")
What do you think about ?

Best

/Nicolas

@inproceedings{Green:2011:MEI:2145432.2145516,
 author = {Green, Spence and de Marneffe, Marie-Catherine and Bauer,
John and Manning, Christopher D.},
 title = {Multiword expression identification with tree substitution
grammars: a parsing tour de force with French},
 booktitle = {Proceedings of the Conference on Empirical Methods in
Natural Language Processing},
 series = {EMNLP '11},
 year = {2011},
 isbn = {978-1-937284-11-4},
 location = {Edinburgh, United Kingdom},
 pages = {725--735},
 numpages = {11},
 url = {http://dl.acm.org/citation.cfm?id=2145432.2145516},
 acmid = {2145516},
 publisher = {Association for Computational Linguistics},
 address = {Stroudsburg, PA, USA},
}

>
> Jörn
>
>
> On 06/21/2012 04:47 PM, Nicolas Hernandez wrote:
>>
>> Hi Jörn
>>
>> On Thu, Jun 21, 2012 at 9:50 AM, Jörn Kottmann<[email protected]>  wrote:
>>>
>>> Hello,
>>>
>>> the lexical unit in the POS Tagger is a token. For the
>>> spanish POS models mutli-token chunks were converted
>>> into one token separated by a "_".
>>>
>>> To what would you set the lexical unit separator in your case?
>>
>> I do the same, but ... I m a bit confused of doing that because
>> 1. I do not like pre and post process my data (here to add/remove an
>> underscore to the multi-words terms)
>> 2. A model trained with the API, which allows you not to preprocess
>> your data, will be different from the model trained with the cli on
>> the same data
>> 3. Finally when you get a model you do not know which segmentation it
>> assumes and how the multi-words terms are represented
>>
>> Since it is often convenient to use the cli it would be nice to set
>> the token separator at least to be able to build the same models than
>> with the API.
>>
>>> The pos tag separator can already be configured in the class
>>> which reads the input, but this parameter is not be set by the cli
>>> tool.
>>>
>>> +1 to make both configurable from the command line.
>>
>> Nice.
>>
>> At least the idea has been proposed. If I have time...
>>
>>> Jörn
>>>
>>>
>>> On 06/20/2012 03:02 PM, Nicolas Hernandez wrote:
>>>>
>>>> Hi Everyone
>>>>
>>>> I need to train the POS tagger on multi-word terms. In other words,
>>>> some of my lexical units are made several tokens separated by
>>>> whitespace characters (like "traffic light", "feu rouge", "in order
>>>> to", ...).
>>>>
>>>> I thing the training API allows to handle that but the command line
>>>> tools cannot. The former takes the words of a sentence as an array of
>>>> string. The latter assumes that the whitespace character is the
>>>> lexical unit separator.
>>>> A convention like concatenating all the words which are part of a
>>>> multi word term is not a solution since in that case models built by
>>>> the command line and by the API will be different.
>>>>
>>>> It would be great if we could set by parameter what is the lexical
>>>> unit separator as well pos tag separator.
>>>>
>>>> What do you think ?
>>>>
>>>> /Nicolas
>>>>
>>>> [1]
>>>>
>>>> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api)
>>>
>>>
>>
>>
>

-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to