Hello Jörg, could you maybe share the configuration for the german_normalize analyzer without stemming? I actually only need the umlaut expansion. And what do you mean by "at the right places in words" for snowball?
Thanks! Andrej Am Sonntag, 30. November 2014 17:20:16 UTC+1 schrieb Jörg Prante: > > Do not use regex, this will give wrong results. > > Elasticsearch comes with full support for german umlaut handling. > > If you install ICU plugin, you can use something like this analysis setting > > { > "index" : { > "analysis" : { > "filter" : { > "german_normalize_stem" : { > "type" : "snowball", > "name" : "German2" > } > }, > "analyzer" : { > "stemmed" : { > "type" : "custom", > "tokenizer" : "standard", > "filter" : [ > "lowercase", > "icu_normalizer", > "icu_folding", > "german_normalize_stem" > ] > }, > "unstemmed" : { > "type" : "custom", > "tokenizer" : "standard", > "filter" : [ > "lowercase", > "icu_normalizer", > "icu_folding", > "german_normalize" > ] > } > } > } > } > } > > ICU handles german umlauts, and also case folding like "ss" and "ß". > > Snowball handles umlaut expansions (ae, oe, ue) at the right places in > words. > > You can choose between stemmed and unstemmed analysis. Snowball tends to > overstem words. The "german_normalize" token filter is copied from Snowball > but works without stemming. > > The effect of the combination is that all german words like Jörg, Joerg, > Jorg are reduced to jorg in the index. > > Best, > > Jörg > > > On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <kresimi...@gmail.com > <javascript:>> wrote: > >> Hi Jürgen, >> >> Currently we don't have big volumes of data to index so we would like to >> yield more results in hope that proper ones would still be shown in the >> top. In future, when we have more data, we'll have to sacrifice some use >> cases in order to provide more precise results for the rest of users. >> >> I think I will try regexp token approach to replace umlauts with "e" >> forms to solve this double expansion problem. >> >> Best, >> >> Krešimir >> >> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) >> wrote: >>> >>> Hi Krešimir, >>> the correct term is "über" (over, above) or "hören" (hear) or "ändern" >>> (change). When you cannot write umlauts, the correct alternative spelling >>> in print is "ueber", "hoeren", "aendern". Everybody can write this in >>> ASCII. However, those who are possibly non-speakers of German who still >>> want to search for German terms are usually not aware of this and believe >>> it's like with accents in French, where "á" is lexically treated like "a". >>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and >>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE >>> letter :-) >>> >>> However, in order to provide a convenience to those users as well, you >>> could decide that - to yield at least some meaningful results - you will >>> also consider the versions without the umlaut dots equivalent. In that >>> case, you want to map any token containing an umlaut (ä, ö, ü) to three >>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. >>> This won't let you distinguish between the "Bar" (bar, the place to get a >>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). >>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, >>> promotion, extraction [geol.]) are also quite different, just to give a few >>> examples. >>> >>> For the proper recognition of those terms, you would normally use a >>> dictionary of German, including some frequent proper names as well. So, if >>> you look for "clown boll", you would not only get "Der Clown im Advent - >>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines >>> Clowns", because the query would be transformed into "clown AND (boll OR >>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. >>> If you dare to normalize your indexed texts, so "Boell" would already have >>> been turned into "Böll", you could even do with a disjunction of only the >>> one correct form and the misspelling. Again, however, you would make use of >>> a dictionary to perform such normalization. Ideally, you would even have a >>> POS tagger in place, so you would only make such replacements where the >>> name Böll is referred to, not the city of Bad Boll. >>> >>> It's a question of how much effort makes sense for your application. If >>> you just want to index some German text, maybe you just want to turn all >>> umlauts into the plain vocals for the purpose of indexing, but still keep >>> the reference to the original for result display. Maybe that's sufficient. >>> For larger volumes of documents, a more precise approach is recommended to >>> avoid false positives. >>> >>> Cheers, >>> --Jürgen >>> >>> >>> On 29.11.2014 20:35, Krešimir Slugan wrote: >>> >>> Because, as far as I understand, in German it's semantically the same to >>> write über or ueber (although ueber is less often used). I guess this is >>> not true only for personal names. >>> Syntactically, "uber" is wrong but users sometimes search for this also. >>> >>> >>> -- >>> >>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С >>> уважением >>> *i.A. Jürgen Wagner* >>> Head of Competence Center "Intelligence" >>> & Senior Cloud Consultant >>> >>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany >>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 >>> 1543 >>> E-Mail: juergen...@devoteam.com, URL: www.devoteam.de >>> ------------------------------ >>> Managing Board: Jürgen Hatzipantelis (CEO) >>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: >>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b7484e8-5752-4bf4-878f-342abadbc5d5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.