On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote: > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
>> I'm glad that you know so much better than Google, Bing, Yahoo, and >> other >> search engines. When I search for "mispealled" Google gives me: [...] > As far as I know, I recognized my mistake. I had more text processing > systems in mind, than search engines. Yes, you have, I acknowledge that now. I see now that at the time I made my response to you, you had already replied recognising your error. Unfortunately I had not seen that. So in that case, I withdraw my comments and apologize. > I can even tell you, I am really stupid. I wrote pure Unicode software > to sort French or German strings. > > Pure unicode == independent from any locale. Unfortunately it is not that simple. The same code point can have different meanings in different languages, and should be treated differently when sorting. The natural Unicode sort order satisfies very few European languages, including English. A few examples: * Swedish ä is a distinct letters of the alphabet, appearing after z: "a b c z ä" is sorted according to Swedish rules. But in German ä is considered to be the letter 'a' plus an umlaut, and is collated after 'a': "a ä b c z" is sorted according to German rules. * In German ö is considered to be a variant of o, equivalent to 'oe', while in Finish ö is a distinct letter which cannot be expanded to 'oe', and which appears at the end of the alphabet. * Similarly, in modern English æ is a ligature of ae, while in Danish and Norwegian is it a distinct letter of the alphabet appearing after z: in English dictionaries, "Æsir" will be found with other "A" words, often expanded to "Aesir", while in Norwegian it will be found after "Z" words. * Most European languages convert uppercase I to lowercase i, but Turkish has distinct letters for dotted and dotless I. According to Turkish rules, lowercase(I) is ı and uppercase(i) is İ. While it is true that the Unicode character set is independent of locale, for natural processing of characters, it isn't enough to just use Unicode. -- Steven -- https://mail.python.org/mailman/listinfo/python-list