Re: Lemmatizer BUG

Rodrigo Agerri Mon, 05 Dec 2016 06:54:22 -0800

Hello,

The javadoc says that the implementation of the statistical lemmatizer is
based on:


http://grzegorz.chrupala.me/papers/phd-single.pdf

Check Chapter 6.

This paper summarizes greatly that chapter

http://grzegorz.chrupala.me/papers/chrupala-etal-2008a/paper.pdf

To cut a long story short, the statistical lemmatizer does not learn the
lemmas themselves, but the automatically induced classes obtained from
calculating how many permutations are required to go from the word form to
the lemma. This is because it is much easier to generalize (e.g., many
word-lemma pairs are captured by the same permutation class) to learn over
those permutation classes than on the lemmas themselves.

HTH,

Rodrigo


On Mon, Dec 5, 2016 at 3:40 PM, Damiano Porta <damianopo...@gmail.com>
wrote:

> Hello Rodrigo!
> Thank you so much! It works perfectly... but, what is the reason behind the
> use of the permuations? Why can we not have the lemma directly?
>
> Thanks for the clarification
> Damiano
>
>
> 2016-12-05 12:12 GMT+01:00 Rodrigo Agerri <rage...@apache.org>:
>
> > Hello,
> >
> > The String[] lemmatize(String[] toks, String[] tags) method will give you
> > predicted "lemma class" which consists of the number of permutations
> > required to go from the word form to the lemma.
> >
> > If the output is O that means that no permutation is required, namely,
> the
> > lemma and the word form are considered to be the same string. The last
> item
> > in the array is for iniziata, and the class means "replace the letter t
> in
> > position 1 with r; replace letter a with letter e in position 0",
> resulting
> > in "iniziare". The word form and lemma strings are reversed for
> comparison.
> > I am assuming that you added the asterisks...
> >
> > Once you have that lemma class prediction array, you need to apply the
> > String[] decodeLemmas(String[] toks, String[] preds) in the same
> > LemmatizerME class, which as the javacode states, it requires the arrays
> of
> > tokens and predicted lemma classes, to perform the decoding (apply the
> > permutations) and output the actual lemma (iniziare in your example).
> >
> > Cheers,
> >
> > Rodrigo
> >
> > On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <damianopo...@gmail.com>
> > wrote:
> >
> > > Hello,
> > > I am doing some tests with the lemmatizerME.
> > > It is returning a wrong word, a word that never occurs in the training
> > > data. Basically it is NOT an italian word :)
> > >
> > > The output is:
> > >
> > > [O, O, O, O, *R1trR0ae*]
> > >
> > > The code:
> > >
> > >         try (InputStream in = new
> > > FileInputStream("/home/damiano/lemmas.bin")) {
> > >             LemmatizerModel lemmatizerModel = new LemmatizerModel(in);
> > >
> > >             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
> > >
> > >             String[] tokens = new String[] {
> > >                 "ultimo", "capitolo", "della", "saga", "iniziata"
> > >             };
> > >
> > >             String[] pos = new String[] {
> > >                 "As", "Ss", "EA", "Ss", "Vp"
> > >             };
> > >
> > >             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> > > pos)));
> > >         }
> > >
> > > How can i analyze what happened?
> > >
> > > Thanks
> > > Damiano
> > >
> >
>

Re: Lemmatizer BUG

Reply via email to