Duh. Four cases. For extra credit, what language is "wunder" in? wunder
On 1/28/09 5:12 PM, "Walter Underwood" <wunderw...@netflix.com> wrote: > I've done this. There are five cases for the tokens in the search > index: > > 1. Tokens that are unique after stemming (this is good). > 2. Tokens that are common after stemming (usually trademarks, > like LaserJet). > 3. Tokens with collisions after stemming: > German "mit", "MIT" the university > German "Boot" (boat), English "boot" (a heavy shoe) > 4. Tokens with collisions in the surface form: > Dutch "mobile" (plural of furniture), English "mobile" > German "die" (stemmed to "das"), English "die" > > You cannot fix every spurious match, but you can do OK with > stemmed fields for each language and a raw (unstemmed surface > token) field. > > I won't recommend weights, but you could have fields for > text_en, text_de, and text_raw, for example. > > You really cannot automatically determine the language of a > query, mostly because of proper nouns, especially trademarks. > Identify the language of these queries: > > * Google > * LaserJet > * Obama > * Las Vegas > * Paris > > HTTP supports an Accept-Language header, but I have no idea > how often that is sent. We honored that in Ultraseek, mostly > because it was standard. > > Finally, if you are working with localization, please take the > time to understand the difference between ISO language codes > and ISO country codes. > > wunder > > On 1/28/09 4:47 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > >> I'm not entirely sure about the fine points, but consider the >> filters that are available that fold all the diacritics into their >> low-ascii equivalents. Perhaps using that filter at *both* index >> and search time on the English index would do the trick. >> >> In your example, both would be 'munchen'. Straight English >> would be unaffected by the filter, but any German words with >> diacritics that crept in would be folded into their low-ascii >> "equivalents". This would also work at index time, just in case >> you indexed English text that had some German words. >> >> NOTE: My experience is more on the Lucene side than the SOLR >> side, but I'm sure the filters are available. >> >> Best >> Erick >> >> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <j...@drun.net> wrote: >> >>> Hi, >>> I currently have two indexes with solr. One for english version and one >>> with german version. They use respectively english/german2 snowball >>> factory. >>> Right now depending on which language is website currently I query >>> corresponding index. >>> There is requirement though that stuff is found regardless in which >>> language is found. >>> So for example if searching for muenchen (will be caught correctly by >>> german snowball factory as münchen) in english index it should be found. >>> Right now >>> it is not as I suppose english factory doesn't really care about umlauts. >>> >>> Any pointers are more than welcome. I am considering synonyms but this >>> will be kinda to heavy to follow/create. >>> Cheers, >>> JD >>> >