> '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL > LETTER O WITH DIAERESIS}' > > I guess the filesystem shouldn't treat these as the same (even though > they are), but what if some webservice does? I suspect you should > normalize both strings before comparing them in any blacklist, and > what happens with surrogates when you normalize?
I think the whole blacklist example is artificial. The string in the blacklist is actually a Chinese "hello" greeting, so it surely isn't the string being blacklisted. For proper blacklisting, you would likely use substring searches, case-insensitivity, transliterations, and perhaps even regular expressions and word stemming. If you consider all these things, proper or alternative encodings of the same text are just another issue to consider. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com