Yes, please upgrade Elasticsearch to use the official german normalizer. I added it to decompound plugin for convenience, it may be removed at any later time.
Jörg On Wed, Mar 11, 2015 at 9:54 PM, Krešimir Slugan <kresimir.slu...@gmail.com> wrote: > Thanks! > > I assume that "german_normalize" is also part of Decompounder Analysis > Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound ) > since that is the only analysis plugin we have installed? > > Btw. "german_normalization" doesn't seems to be available for our ES > version (1.2), would you recommend upgrading instead of using > "german_normalize"? > > Best, > > Kresimir > > On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote: >> >> Use "german_normalization" >> >> "german_normalize" is the same filter I implemented in my plugin >> https://github.com/jprante/elasticsearch-analysis-german/ >> blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/ >> GermanAnalysisBinderProcessor.java when it was not available in ES core. >> >> Jörg >> >> On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <kresimi...@gmail.com> >> wrote: >> >>> >>> Where is this "german_normalize" filter coming from? It solves my >>> problem completely and magically but it's not documented anywhere (and >>> seems like it's not part of ICU plugin either). >>> >>> >>> >>> What is also weird is that filter can not be used in global context, >>> e.g. it's not possible to try something like this: >>> >>> curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters= >>> lowercase,german_normalize' -d 'this is a test' >>> >>> but it is possible to use it in index context: >>> >>> curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace& >>> filters=lowercase,german_normalize' -d 'this is a test' >>> >>> >>> In first case I get "*ElasticsearchIllegalArgumentException[failed to >>> find global token filter under [german_normalize]]*" >>> >>> >>> On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote: >>> >>>> Do not use regex, this will give wrong results. >>>> >>>> Elasticsearch comes with full support for german umlaut handling. >>>> >>>> If you install ICU plugin, you can use something like this analysis >>>> setting >>>> >>>> { >>>> "index" : { >>>> "analysis" : { >>>> "filter" : { >>>> "german_normalize_stem" : { >>>> "type" : "snowball", >>>> "name" : "German2" >>>> } >>>> }, >>>> "analyzer" : { >>>> "stemmed" : { >>>> "type" : "custom", >>>> "tokenizer" : "standard", >>>> "filter" : [ >>>> "lowercase", >>>> "icu_normalizer", >>>> "icu_folding", >>>> "german_normalize_stem" >>>> ] >>>> }, >>>> "unstemmed" : { >>>> "type" : "custom", >>>> "tokenizer" : "standard", >>>> "filter" : [ >>>> "lowercase", >>>> "icu_normalizer", >>>> "icu_folding", >>>> "german_normalize" >>>> ] >>>> } >>>> } >>>> } >>>> } >>>> } >>>> >>>> ICU handles german umlauts, and also case folding like "ss" and "ß". >>>> >>>> Snowball handles umlaut expansions (ae, oe, ue) at the right places in >>>> words. >>>> >>>> You can choose between stemmed and unstemmed analysis. Snowball tends >>>> to overstem words. The "german_normalize" token filter is copied from >>>> Snowball but works without stemming. >>>> >>>> The effect of the combination is that all german words like Jörg, >>>> Joerg, Jorg are reduced to jorg in the index. >>>> >>>> Best, >>>> >>>> Jörg >>>> >>>> >>>> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com >>>> > wrote: >>>> >>>>> Hi Jürgen, >>>>> >>>>> Currently we don't have big volumes of data to index so we would like >>>>> to yield more results in hope that proper ones would still be shown in the >>>>> top. In future, when we have more data, we'll have to sacrifice some use >>>>> cases in order to provide more precise results for the rest of users. >>>>> >>>>> I think I will try regexp token approach to replace umlauts with "e" >>>>> forms to solve this double expansion problem. >>>>> >>>>> Best, >>>>> >>>>> Krešimir >>>>> >>>>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) >>>>> wrote: >>>>>> >>>>>> Hi Krešimir, >>>>>> the correct term is "über" (over, above) or "hören" (hear) or >>>>>> "ändern" (change). When you cannot write umlauts, the correct alternative >>>>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write >>>>>> this >>>>>> in ASCII. However, those who are possibly non-speakers of German who >>>>>> still >>>>>> want to search for German terms are usually not aware of this and believe >>>>>> it's like with accents in French, where "á" is lexically treated like >>>>>> "a". >>>>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" >>>>>> and >>>>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE >>>>>> letter :-) >>>>>> >>>>>> However, in order to provide a convenience to those users as well, >>>>>> you could decide that - to yield at least some meaningful results - you >>>>>> will also consider the versions without the umlaut dots equivalent. In >>>>>> that >>>>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three >>>>>> alternatives: umlaut, without umlaut marker, alternative spelling with >>>>>> 'e'. >>>>>> This won't let you distinguish between the "Bar" (bar, the place to get a >>>>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). >>>>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, >>>>>> promotion, extraction [geol.]) are also quite different, just to give a >>>>>> few >>>>>> examples. >>>>>> >>>>>> For the proper recognition of those terms, you would normally use a >>>>>> dictionary of German, including some frequent proper names as well. So, >>>>>> if >>>>>> you look for "clown boll", you would not only get "Der Clown im Advent - >>>>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines >>>>>> Clowns", because the query would be transformed into "clown AND (boll OR >>>>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. >>>>>> If you dare to normalize your indexed texts, so "Boell" would already >>>>>> have >>>>>> been turned into "Böll", you could even do with a disjunction of only the >>>>>> one correct form and the misspelling. Again, however, you would make use >>>>>> of >>>>>> a dictionary to perform such normalization. Ideally, you would even have >>>>>> a >>>>>> POS tagger in place, so you would only make such replacements where the >>>>>> name Böll is referred to, not the city of Bad Boll. >>>>>> >>>>>> It's a question of how much effort makes sense for your application. >>>>>> If you just want to index some German text, maybe you just want to turn >>>>>> all >>>>>> umlauts into the plain vocals for the purpose of indexing, but still keep >>>>>> the reference to the original for result display. Maybe that's >>>>>> sufficient. >>>>>> For larger volumes of documents, a more precise approach is recommended >>>>>> to >>>>>> avoid false positives. >>>>>> >>>>>> Cheers, >>>>>> --Jürgen >>>>>> >>>>>> >>>>>> On 29.11.2014 20:35, Krešimir Slugan wrote: >>>>>> >>>>>> Because, as far as I understand, in German it's semantically the same >>>>>> to write über or ueber (although ueber is less often used). I guess this >>>>>> is >>>>>> not true only for personal names. >>>>>> Syntactically, "uber" is wrong but users sometimes search for this >>>>>> also. >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С >>>>>> уважением >>>>>> *i.A. Jürgen Wagner* >>>>>> Head of Competence Center "Intelligence" >>>>>> & Senior Cloud Consultant >>>>>> >>>>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany >>>>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 >>>>>> 1543 >>>>>> E-Mail: juergen...@devoteam.com, URL: www.devoteam.de >>>>>> ------------------------------ >>>>>> Managing Board: Jürgen Hatzipantelis (CEO) >>>>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: >>>>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 >>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdxomzMhbZT8Grr4c9fUqrb4v0UA9v6EYmxBPBKCf%3D0g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.