[
https://issues.apache.org/jira/browse/LUCENE-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated LUCENE-6809:
-----------------------------------
Labels: pull-request-available (was: )
> DictionaryCompoundWordTokenFilter should respect minSubwordSize also for
> fragments
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-6809
> URL: https://issues.apache.org/jira/browse/LUCENE-6809
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 5.3, 6.0
> Reporter: Christian Winkler
> Priority: Major
> Labels: pull-request-available
> Attachments: LUCENE-6809.diff
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> {{DictionaryCompoundWordTokenFilter}} is very useful for building German
> search indices. However this can also lead to ambiguities as words might be
> extracted which have a completely different meaning. Most often this happens
> if the remaining parts of the word are too small.
> Example: {{schwein}} (German for pig) contains {{wein}} (German for wine).
> Even if {{minSubwordSize}} is set to {{4}}, {{wein}} gets extracted though
> {{sch}} is shorter than 4 characters.
> We could solve this by requiring all word parts to be part of the dictionary
> (at most 3), but this creates problems with compound words of more than three
> nouns.
> Therefore we have built an alternate solution where {{minSubwordSize}} is
> also applied to the rest of the fragments. We have tested this in several
> (large) customer indices and it is working much better than before.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]