> '\N{LATIN SMALL LETTER O}\N{COMBINING DIAERESIS}' != '\N{LATIN SMALL
> LETTER O WITH DIAERESIS}'
> 
> I guess the filesystem shouldn't treat these as the same (even though
> they are), but what if some webservice does? I suspect you should
> normalize both strings before comparing them in any blacklist, and
> what happens with surrogates when you normalize?

I think the whole blacklist example is artificial. The string in the
blacklist is actually a Chinese "hello" greeting, so it surely isn't
the string being blacklisted. For proper blacklisting, you would likely
use substring searches, case-insensitivity, transliterations, and
perhaps even regular expressions and word stemming. If you consider all
these things, proper or alternative encodings of the same text are just
another issue to consider.

Regards,
Martin


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to