Re: char_filter for German

joergpra...@gmail.com Thu, 12 Mar 2015 08:29:45 -0700

Yes, please upgrade Elasticsearch to use the official german normalizer.

I added it to decompound plugin for convenience, it may be removed at any
later time.


Jörg

On Wed, Mar 11, 2015 at 9:54 PM, Krešimir Slugan <kresimir.slu...@gmail.com>
wrote:

> Thanks!
>
> I assume that "german_normalize" is also part of Decompounder Analysis
> Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound )
> since that is the only analysis plugin we have installed?
>
> Btw. "german_normalization" doesn't seems to be available for our ES
> version (1.2), would you recommend upgrading instead of using
>  "german_normalize"?
>
> Best,
>
> Kresimir
>
> On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:
>>
>> Use "german_normalization"
>>
>> "german_normalize" is the same filter I implemented in my plugin
>> https://github.com/jprante/elasticsearch-analysis-german/
>> blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/
>> GermanAnalysisBinderProcessor.java when it was not available in ES core.
>>
>> Jörg
>>
>> On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <kresimi...@gmail.com>
>> wrote:
>>
>>>
>>> Where is this "german_normalize" filter coming from? It solves my
>>> problem completely and magically but it's not documented anywhere (and
>>> seems like it's not part of ICU plugin either).
>>>
>>>
>>>
>>> What is also weird is that filter can not be used in global context,
>>> e.g. it's not possible to try something like this:
>>>
>>> curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=
>>> lowercase,german_normalize' -d 'this is a test'
>>>
>>> but it is possible to use it in index context:
>>>
>>> curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&
>>> filters=lowercase,german_normalize' -d 'this is a test'
>>>
>>>
>>> In first case I get "*ElasticsearchIllegalArgumentException[failed to
>>> find global token filter under [german_normalize]]*"
>>>
>>>
>>> On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
>>>
>>>> Do not use regex, this will give wrong results.
>>>>
>>>> Elasticsearch comes with full support for german umlaut handling.
>>>>
>>>> If you install ICU plugin, you can use something like this analysis
>>>> setting
>>>>
>>>> {
>>>>     "index" : {
>>>>         "analysis" : {
>>>>             "filter" : {
>>>>                 "german_normalize_stem" : {
>>>>                   "type" : "snowball",
>>>>                   "name" : "German2"
>>>>                 }
>>>>             },
>>>>             "analyzer" : {
>>>>                 "stemmed" : {
>>>>                     "type" : "custom",
>>>>                     "tokenizer" : "standard",
>>>>                     "filter" : [
>>>>                         "lowercase",
>>>>                         "icu_normalizer",
>>>>                         "icu_folding",
>>>>                         "german_normalize_stem"
>>>>                     ]
>>>>                 },
>>>>                 "unstemmed" : {
>>>>                     "type" : "custom",
>>>>                     "tokenizer" : "standard",
>>>>                     "filter" : [
>>>>                         "lowercase",
>>>>                         "icu_normalizer",
>>>>                         "icu_folding",
>>>>                         "german_normalize"
>>>>                     ]
>>>>                 }
>>>>             }
>>>>         }
>>>>     }
>>>> }
>>>>
>>>> ICU handles german umlauts, and also case folding like "ss" and "ß".
>>>>
>>>> Snowball handles umlaut expansions (ae, oe, ue) at the right places in
>>>> words.
>>>>
>>>> You can choose between stemmed and unstemmed analysis. Snowball tends
>>>> to overstem words. The "german_normalize" token filter is copied from
>>>> Snowball but works without stemming.
>>>>
>>>> The effect of the combination is that all german words like Jörg,
>>>>  Joerg, Jorg are reduced to jorg in the index.
>>>>
>>>> Best,
>>>>
>>>> Jörg
>>>>
>>>>
>>>> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Jürgen,
>>>>>
>>>>> Currently we don't have big volumes of data to index so we would like
>>>>> to yield more results in hope that proper ones would still be shown in the
>>>>> top. In future, when we have more data, we'll have to sacrifice some use
>>>>> cases in order to provide more precise results for the rest of users.
>>>>>
>>>>> I think I will try regexp token approach to replace umlauts with "e"
>>>>> forms to solve this double expansion problem.
>>>>>
>>>>> Best,
>>>>>
>>>>> Krešimir
>>>>>
>>>>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
>>>>> wrote:
>>>>>>
>>>>>>  Hi Krešimir,
>>>>>>   the correct term is "über" (over, above) or "hören" (hear) or
>>>>>> "ändern" (change). When you cannot write umlauts, the correct alternative
>>>>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write 
>>>>>> this
>>>>>> in ASCII. However, those who are possibly non-speakers of German who 
>>>>>> still
>>>>>> want to search for German terms are usually not aware of this and believe
>>>>>> it's like with accents in French, where "á" is lexically treated like 
>>>>>> "a".
>>>>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" 
>>>>>> and
>>>>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
>>>>>> letter :-)
>>>>>>
>>>>>> However, in order to provide a convenience to those users as well,
>>>>>> you could decide that - to yield at least some meaningful results - you
>>>>>> will also consider the versions without the umlaut dots equivalent. In 
>>>>>> that
>>>>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three
>>>>>> alternatives: umlaut, without umlaut marker, alternative spelling with 
>>>>>> 'e'.
>>>>>> This won't let you distinguish between the "Bar" (bar, the place to get a
>>>>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug).
>>>>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation,
>>>>>> promotion, extraction [geol.]) are also quite different, just to give a 
>>>>>> few
>>>>>> examples.
>>>>>>
>>>>>> For the proper recognition of those terms, you would normally use a
>>>>>> dictionary of German, including some frequent proper names as well. So, 
>>>>>> if
>>>>>> you look for "clown boll", you would not only get "Der Clown im Advent -
>>>>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
>>>>>> Clowns", because the query would be transformed into "clown AND (boll OR
>>>>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
>>>>>> If you dare to normalize your indexed texts, so "Boell" would already 
>>>>>> have
>>>>>> been turned into "Böll", you could even do with a disjunction of only the
>>>>>> one correct form and the misspelling. Again, however, you would make use 
>>>>>> of
>>>>>> a dictionary to perform such normalization. Ideally, you would even have 
>>>>>> a
>>>>>> POS tagger in place, so you would only make such replacements where the
>>>>>> name Böll is referred to, not the city of Bad Boll.
>>>>>>
>>>>>> It's a question of how much effort makes sense for your application.
>>>>>> If you just want to index some German text, maybe you just want to turn 
>>>>>> all
>>>>>> umlauts into the plain vocals for the purpose of indexing, but still keep
>>>>>> the reference to the original for result display. Maybe that's 
>>>>>> sufficient.
>>>>>> For larger volumes of documents, a more precise approach is recommended 
>>>>>> to
>>>>>> avoid false positives.
>>>>>>
>>>>>> Cheers,
>>>>>> --Jürgen
>>>>>>
>>>>>>
>>>>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>>>>
>>>>>> Because, as far as I understand, in German it's semantically the same
>>>>>> to write über or ueber (although ueber is less often used). I guess this 
>>>>>> is
>>>>>> not true only for personal names.
>>>>>> Syntactically, "uber" is wrong but users sometimes search for this
>>>>>> also.
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
>>>>>> уважением
>>>>>> *i.A. Jürgen Wagner*
>>>>>> Head of Competence Center "Intelligence"
>>>>>> & Senior Cloud Consultant
>>>>>>
>>>>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>>>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
>>>>>> 1543
>>>>>> E-Mail: juergen...@devoteam.com, URL: www.devoteam.de
>>>>>> ------------------------------
>>>>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>>>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
>>>>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>>>>>>
>>>>>>
>>>>>>    --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdxomzMhbZT8Grr4c9fUqrb4v0UA9v6EYmxBPBKCf%3D0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: char_filter for German

Reply via email to