Martin wrote: > > The point is that you can tell UTF-8 reliably.
RFC 3629 says "fairly reliably" rather than "reliably", but they mean the same thing... > > If the data decodes > > as UTF-8, it *is* UTF-8, because no other encoding in the world > > produces the same byte sequences (except for ASCII, which is > > an UTF-8 subset). or as the RFC puts it, "the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length". ::: Ross Ridge wrote: > It should be obvious that any 8-bit single-byte character set can > produce byte sequences that are valid in UTF-8. it should be fairly obvious that you don't know much about UTF-8... </F> -- http://mail.python.org/mailman/listinfo/python-list