Hi, I have just checked out the latest version of Lucene from Git master branch.
I have tried to stem a few words using StempelStemmer for Polish. However, it looks it cannot handle some words properly, e.g. joyce -> ąć wielce -> ąć piwko -> ąć royce -> ąć pip -> ąć xyz -> xyz 1. I surprised it cannot handle Polish words like wielce, piwko and royce. Is this a limitation of the stemming algorithm or a training of the algorithm or something else? The latter would help improve the situation. How can I improve that behaviour? 2. I am surprised that for non-Polish words it returns "ać". I would expect that for words it has not be trained for it will return their original forms, as it happens, for instance, when stemming words like "xyz". With kind regards, Maciej Gawinecki Here's minimal example to reproduce the issue: package org.apache.lucene.analysis; import java.io.InputStream; import org.apache.lucene.analysis.stempel.StempelStemmer; public class Try { public static void main(String[] args) throws Exception { InputStream stemmerTabke = ClassLoader.getSystemClassLoader() .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl"); StempelStemmer stemmer = new StempelStemmer(stemmerTabke); String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"}; for (String word : words) { System.out.println(String.format("%s -> %s", word, stemmer.stem("piwko"))); } } }