Hi,
I have the requirement to index and stem Croatian, Macedonian, Serbian
and Slovenian content. I started by creating a collection _hr_ for the
Croatian content and configured the HunSpellStemFilterFactory using the
.dic and .aff files provided by OpenOffice. While testing my
configuration I noticed that only very simple forms such as
hrvatski -> hrvatska,
algoritamskom -> algoritamska
get "stemmed". I was wondering whether there are better approaches for
Croatian content. I haven't tested the dict and aff files for the other
languages yet but I would expect similar results.
I am using Solr 4.1.
Any pointers to better stemmers, open source or commercial, are much
appreciated.
Many thanks,
Alex