Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Nicolas Hernandez Wed, 27 Jun 2012 06:35:41 -0700

On Wed, Jun 27, 2012 at 10:40 AM, Jörn Kottmann <[email protected]> wrote:
> On 06/21/2012 05:31 PM, Nicolas Hernandez wrote:
>>
>> On Thu, Jun 21, 2012 at 10:00 AM, Jörn Kottmann <[email protected]>
>> wrote:
>>>
>>> BTW, if you are training on the FrenchTreebank.
>>> We have dedicated format support for it in the trunk, would be
>>> easy to do the POS training with it.
>>
>> =)
>> I didn't know. It is nice for opennlp. I am sorry I did not answered
>> you about your invitation to integrate my own code. My approach was
>> not dedicated to the parsing of the FrenchTreebank. So I could not
>> integrate it easily.
>>
>> I ve tried the converter. I am not sure how to use it ?
>> [2] gives a sentence per line with no pos tag associated with the tokens.
>>
>> Anyway, it is very tricky to choose  what considering as tokens either
>> compound or only simple words, or what pos tag to give to the tokens.
>> I am not sure to understand well the choices which have been made in [1].
>> As soon as I manage to make the converter works, it will be more
>> simple to see them.
>
>
> The current implementation uses the same tag for all multi-word-unit
> tokens.
> For example: in_order_to/IN will be in/IN oder/IN to/IN.
>
> Would you use the pos tags as they are in the data? Maybe it would be useful
> to add support for pos tag mappings. This would make it easy to experiment
> with
> different tag sets.


It is never easy to map tagsets since there is rarely a bijective
relation and in addition, both process should assume the same
tokenization.

I made several experiences.

Since I like to use the maltparser [1], I need now to adapt the ftb to
a tag set called ftb+ as described by [2].
(only multi-word expressions which can be recognized by regular
expression are considered, some pos tags result in the concatenation
of the cat and subcat attributes...)

I plan to do it by processing the MarkupAnnotations provided by the
Tika MarkupAnnotator [3].

[1] http://www.maltparser.org/mco/french_parser/fremalt.html
[2] P. Denis and B. Sagot. 2009. Coupling an annotated corpus and a
morphosyntactic lexicon for state-of-the-art POS tagging with less
human effort. In Proceedings of The Pacific Asia Conference on
Language, Information and Computation (PACLIC 23), Hong Kong, China.
[3] http://uima.apache.org/sandbox.html#tika.annotator


>
> Looks like we want to give the user some options on how these cases will be
> handled.
>
> We need to add direct support for training the POS Tagger on this data.
> Then you can do:
> bin/opennlp POSTaggerConverter frenchtreebank ...
>
> There is a problem with the Parse object and the way the command line tools
> are build. The cli tools assume that hey can serialize via toString a sample
> object
> into training data, but that does not work for the Parse object yet. To fix
> that we need
> to make a breaking API change and need to refactor some code in the coref
> component.
> Anyway it would be nice to get the parser trained on it as well.
>
> Jörn

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to