[
https://issues.apache.org/jira/browse/LUCENE-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Samuel García Martínez updated LUCENE-4793:
-------------------------------------------
Description:
Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
Spellchecker) behaviour i think i found a bug when the input is a 6 letter word:
- george
- anthem
- argued
- fluent
Due to the getMin() and getMax() the grams indexed for these terms are 3 and 4.
So, the fields would be something like this:
- for "george"
- start3: "geo"
- start4: "geor"
- end3: "rge"
- end4: "orge"
- 3: "geo", "eor", "org", "rge"
- 4: "geor", "eorg", "orge"
- for "anthem"
- start3: "ant"
- start4: "anth"
- end3: "tem"
- end4: "them"
The problem shows up when the user swap 3rd a 4th characters, misspelling the
word like this:
- geroge
- anhtem
The queries generated for this terms are: (SHOULD boolean queries)
- for "geroge"
- start3: "ger"
- start4: "gero"
- end3: "oge"
- end4: "roge"
- 3: "ger", "ero", "rog", "oge"
- 4: "gero", "erog", "roge"
- for "anhtem"
- start3: "anh"
- start4: "anht"
- end3: "tem"
- end4: "htem"
- 3: "anh", "nht", "hte", "tem"
- 4: "anht", "nhte", "htem"
So, as you can see, this kind of misspelling never matches the suitable
suggestions although the edit distance is 0.95555556.
I think getMin(int l) and getMax(int l) should return 2 and 3, respectively,
for l==6. Debugging other values i did not found any problem with any kind of
misspelling.
> Spellchecker don't find suggestion for concrete misspelled 6 letter words
> -------------------------------------------------------------------------
>
> Key: LUCENE-4793
> URL: https://issues.apache.org/jira/browse/LUCENE-4793
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/spellchecker
> Affects Versions: 3.6, 4.0, 4.1
> Reporter: Samuel García Martínez
> Priority: Minor
>
> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene
> Spellchecker) behaviour i think i found a bug when the input is a 6 letter
> word:
> - george
> - anthem
> - argued
> - fluent
> Due to the getMin() and getMax() the grams indexed for these terms are 3 and
> 4. So, the fields would be something like this:
> - for "george"
> - start3: "geo"
> - start4: "geor"
> - end3: "rge"
> - end4: "orge"
> - 3: "geo", "eor", "org", "rge"
> - 4: "geor", "eorg", "orge"
> - for "anthem"
> - start3: "ant"
> - start4: "anth"
> - end3: "tem"
> - end4: "them"
> The problem shows up when the user swap 3rd a 4th characters, misspelling the
> word like this:
> - geroge
> - anhtem
> The queries generated for this terms are: (SHOULD boolean queries)
> - for "geroge"
> - start3: "ger"
> - start4: "gero"
> - end3: "oge"
> - end4: "roge"
> - 3: "ger", "ero", "rog", "oge"
> - 4: "gero", "erog", "roge"
> - for "anhtem"
> - start3: "anh"
> - start4: "anht"
> - end3: "tem"
> - end4: "htem"
> - 3: "anh", "nht", "hte", "tem"
> - 4: "anht", "nhte", "htem"
> So, as you can see, this kind of misspelling never matches the suitable
> suggestions although the edit distance is 0.95555556.
> I think getMin(int l) and getMax(int l) should return 2 and 3, respectively,
> for l==6. Debugging other values i did not found any problem with any kind of
> misspelling.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]