Thanks again. I will chunk my responses as your message has too much in it for me to process all at once...
On Jan 2, 9:34 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Thanks a lot Martin and Marc for the really great explanations! I was > > wondering if it would be reasonable to imagine a utility that will > > determine whether, for a given encoding, two byte strings would be > > equivalent. > > But that is much easier to answer: > > s1.decode(enc) == s2.decode(enc) > > Assuming Unicode's unification, for a single encoding, this should > produce correct results in all cases I'm aware of. > > If the you also have different encodings, you should add > > def normal_decode(s, enc): > return unicode.normalize("NFKD", s.decode(enc)) > > normal_decode(s1, enc) == normal_decode(s2, enc) > > This would flatten out compatibility characters, and ambiguities > left in Unicode itself. Hmmn, true, it would be that easy. I am now not sure why I needed that check, or how to use this version of it... I am always starting from one string, and decoding it... that may be lossy when that is re-encoded, and compared to original. However it is clear that the test above should always pass in this case, so doing it seems superfluos. Thanks for the unicodedata.normalize() tip. mario -- http://mail.python.org/mailman/listinfo/python-list