Limitations of StempelStemmer

Maciej Gawinecki Mon, 09 Sep 2019 04:02:37 -0700

Hi,

I have just checked out the latest version of Lucene from Git master branch.


I have tried to stem a few words using StempelStemmer for Polish.
However, it looks it cannot handle some words properly, e.g.

joyce -> ąć
wielce -> ąć
piwko -> ąć
royce -> ąć
pip -> ąć
xyz -> xyz

1. I surprised it cannot handle Polish words like wielce, piwko and
royce. Is this a limitation of the stemming algorithm or a training of
the algorithm or something else? The latter would help improve the
situation. How can I improve that behaviour?
2. I am surprised that for non-Polish words it returns "ać". I would
expect that for words it has not be trained for it will return their
original forms, as it happens, for instance, when stemming words like
"xyz".

With kind regards,
Maciej Gawinecki

Here's minimal example to reproduce the issue:

package org.apache.lucene.analysis;

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

  public static void main(String[] args) throws Exception {
    InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
        .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
    StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
    String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
    for (String word : words) {
      System.out.println(String.format("%s -> %s", word,
stemmer.stem("piwko")));
    }

  }

}

Limitations of StempelStemmer

Reply via email to