Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Jörn Kottmann Wed, 27 Jun 2012 06:40:38 -0700

On 06/27/2012 03:07 PM, Nicolas Hernandez wrote:

On Wed, Jun 27, 2012 at 10:03 AM, Jörn Kottmann <[email protected]> wrote:


That would move the responsibility to detect multi-word-units
to the POS Tagger.

A simple transformation step could convert the output to the
non-multi-word-tags format.

This has the advantage that a user can detect a multi-word unit.

The "special pos tags" would be for the whole multi-word expressions
(MWE) or for each word of the MWE ?

Anyway, according to me, the pos tagger trainer should not be aware of
that and we must not force the users to change their tagset to use the
opennlp tools.
If the data is annotated with pos tags which inform about MWE then
they will be handled like any tags over simple words.


Yes that should just work with our current way of handling the data.
A user even has the option with the current implementation to customize
the POS Tagger to handle it differently, even including a custom
MWE detector model in the pos model package would be possible.

That should be fine as it is.

The cli should offer a way to specify what are multi-word expressions
in the data.
This can be done by using a parameter to set what is the token
separator character.

Models built from the cli or the API should be the same.
One way to do that is to use a parameter to set what is the multi-word
separator character and to turn this separator character into
whitespace before training the model.
For example with " " as the token separator character, "_" as the
multi-word separator character and "/" as the pos tag separator, the
following sentence
Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
sleep/VB longer/RB
should be turn into
String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
"earlier", "in order to", "sleep", "longer"};
(note "in order to")
What do you think about ?

I think that could be problematic if your training or test data contains
the multi-word-separator character. In this case you might consider
something
as a multi-word-unit which should not be one.
What do you think about using SGML style tags as we do in the NER training
format?
For example: <MW>in order to</MW>/IN.

I do not like to mix the annotation systems: either everything in XML
(<MW pos="IN">in order to</MW>) or not.
The multi-word separator can be a string not a character (e.g."_##_"
which is quite rare). The point is that the user should be informed
about the problem you mention and since it is up to him to set the
string by parameter, he will do an aware and wise choice.
If you mix both annotations systems, then the ambiguity problem
remains also for the start tag and end tag you use.

Would you prefer, dealing with multi-word-units at the tokenizer level
or at the POS Tagger level?
Or do we need support for both?


I think we agree that whatever the analyzers we use (POS, Chunk,
Parser, NER...), all should have been built on data word-tokenized in
the same way.


If  a token can be a MWE then we need to fix all our formats to
support the MWE separator char sequence. Currently we use the whitespace
tokenizer in most of the places to process our training data.
That we should change and use one which is sensitive to MWEs separated by
a char sequence.

We should specify a default MWE separator which is used when serializingdata with MWEs

and offer an option to specify it.

Personally I do not use the OpenNLP word tokenizer. Actually I used it
but I also use dictionaries and regex expressions which lead me to a
richer concept of "word" (what it is called "lexical unit"). I take
them as input of my processing chain.

And since I used also UIMA and since UIMA process annotations. If the
annotation stands for "lexical unit", some can be MWE or simple word,
it is transparent for the user.

So I would have like that the OpenNLP pos tagger/trainer cli offer me
a way to build models I can use with UIMA without pre/post processing.

In my opinion, an OpenNLP labeller/trainer should offer the
possibility to the users of adapting its input/output to the users'
data and not the opposite.


Yes, I see this in the same way.

Would you mind to open a jira issue to request MWE support in our
training formats?

Jörn

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to