Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Jörn Kottmann Thu, 21 Jun 2012 08:52:19 -0700

I will have look at this next week. It should be possible
to train the POS Tagger directly from the file.


Anyway, we still need to over some options for tag set
and multi word handling.
It should be easy for a user to provide a tag-set mapping in case
they want to change something during training.

Jörn

On 06/21/2012 05:31 PM, Nicolas Hernandez wrote:

On Thu, Jun 21, 2012 at 10:00 AM, Jörn Kottmann<[email protected]>  wrote:

BTW, if you are training on the FrenchTreebank.
We have dedicated format support for it in the trunk, would be
easy to do the POS training with it.

=)
I didn't know. It is nice for opennlp. I am sorry I did not answered
you about your invitation to integrate my own code. My approach was
not dedicated to the parsing of the FrenchTreebank. So I could not
integrate it easily.

I ve tried the converter. I am not sure how to use it ?
[2] gives a sentence per line with no pos tag associated with the tokens.

Anyway, it is very tricky to choose  what considering as tokens either
compound or only simple words, or what pos tag to give to the tokens.
I am not sure to understand well the choices which have been made in [1].
As soon as I manage to make the converter works, it will be more
simple to see them.

Best regards

[1] 
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/frenchtreebank/ConstitDocumentHandler.java?view=markup
[2] ./bin/opennlp ParserConverter frenchtreebank  -lang fr -data my/ftb/dir/


Jörn

On 06/20/2012 03:02 PM, Nicolas Hernandez wrote:

Hi Everyone

I need to train the POS tagger on multi-word terms. In other words,
some of my lexical units are made several tokens separated by
whitespace characters (like "traffic light", "feu rouge", "in order
to", ...).

I thing the training API allows to handle that but the command line
tools cannot. The former takes the words of a sentence as an array of
string. The latter assumes that the whitespace character is the
lexical unit separator.
A convention like concatenating all the words which are part of a
multi word term is not a solution since in that case models built by
the command line and by the API will be different.

It would be great if we could set by parameter what is the lexical
unit separator as well pos tag separator.

What do you think ?

/Nicolas

[1]
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api)

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to