[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Michael McCandless (JIRA) Tue, 16 Jul 2013 04:57:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709695#comment-13709695
 ]


Michael McCandless commented on LUCENE-5030:
--------------------------------------------

Sorry for the long delay here ...

Just to verify: there is no point to passing FUZZY_UNICODE_AWARE to 
AnalyzingSuggester, right?

In which case, I think we the AnalyzingLookupFactory should not be changed?

But, furthermore, I think we can isolate the changes to FuzzySuggester?  E.g., 
move the FUZZY_UNICODE_AWARE flag down to FuzzySuggester, fix its ctor to strip 
that option when calling super() and move the isFuzzyUnicodeAware down as well, 
and then override toLookupAutomaton to do the utf8 conversion + det?

This way it's not even possible to send the fuzzy flag to AnalyzingSuggester.
                
> FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work 
> correctly for 1-byte (like English) and multi-byte (non-Latin) letters
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5030
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.3
>            Reporter: Artem Lukanin
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.4
>
>         Attachments: benchmark-INFO_SEP.txt, benchmark-old.txt, 
> benchmark-wo_convertion.txt, LUCENE-5030.patch, LUCENE-5030.patch, 
> LUCENE-5030.patch, LUCENE-5030.patch, nonlatin_fuzzySuggester1.patch, 
> nonlatin_fuzzySuggester2.patch, nonlatin_fuzzySuggester3.patch, 
> nonlatin_fuzzySuggester4.patch, nonlatin_fuzzySuggester_combo1.patch, 
> nonlatin_fuzzySuggester_combo2.patch, nonlatin_fuzzySuggester_combo.patch, 
> nonlatin_fuzzySuggester.patch, nonlatin_fuzzySuggester.patch, 
> nonlatin_fuzzySuggester.patch, run-suggest-benchmark.patch
>
>
> There is a limitation in the current FuzzySuggester implementation: it 
> computes edits in UTF-8 space instead of Unicode character (code point) 
> space. 
> This should be fixable: we'd need to fix TokenStreamToAutomaton to work in 
> Unicode character space, then fix FuzzySuggester to do the same steps that 
> FuzzyQuery does: do the LevN expansion in Unicode character space, then 
> convert that automaton to UTF-8, then intersect with the suggest FST.
> See the discussion here: 
> http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5030) FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

Reply via email to