[jira] [Commented] (LUCENE-4311) HunspellStemFilter returns another values than Hunspell in console / command line with same dictionaries.

Lukas Vlcek (JIRA) Tue, 16 Jul 2013 10:46:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709978#comment-13709978
 ]


Lukas Vlcek commented on LUCENE-4311:
-------------------------------------

Hi Chris,

I have been doing some experiments with this czech dictionary and to me it 
seems that it yields the best results with RECURSION_CAP = 0. Seriously! The 
double folding does not bring any advantage in case of this particular 
dictionary. In fact the dictionary is in such a good shape that it allows for 
direct generation of all word forms for words in dic file and only one affix 
rule is enough for input words to see if it matches any of the root forms, no 
folding needed at all.

With RECURSION_CAP 1 or 2 it can generate a lot of incorrect words. The shorter 
the input word is the higher chance of getting incorrect (i.e. completely 
misleading) results up to the point where it is not useful for Lucene indexing 
at all.

Please, can we have this fixed? I believe all is needed now is to have a look 
at #LUCENE-4542 and make sure the recursion level is configurable. This would 
be really great enhancement.
                
> HunspellStemFilter returns another values than Hunspell in console / command 
> line with same dictionaries.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4311
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4311
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/other
>    Affects Versions: 3.5, 4.0-ALPHA, 3.6.1
>         Environment: Apache Solr 3.5 - 4.0, Apache Tomcat 7.0
>            Reporter: Jan Rieger
>         Attachments: cs_CZ.aff, cs_CZ.dic
>
>
> When I used HunspellStemFilter for stemming the czech language text, it 
> returns me bad results.
> For example word "praha" returns "praha" and "prahnout", what is not correct.
> So I try the same in my console (Hunspell command line) with exactly same 
> dictionaries and it returns only "praha" and this is correct.
> Can somebody help me?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4311) HunspellStemFilter returns another values than Hunspell in console / command line with same dictionaries.

Reply via email to