[EMAIL PROTECTED] wrote: > > I was playing with python encodings and noticed this: > > [EMAIL PROTECTED]:~$ python2.4 > Python 2.4 (#2, Dec 3 2004, 17:59:05) > [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> unicode('\x9d', 'iso8859_1') > u'\x9d' >>>> > > U+009D is NOT a valid unicode character (it is not even a iso8859_1 > valid character)
It *IS* a valid unicode and iso8859-1 character, so the behaviour of the python decoder is correct. The range U+0080 - U+009F is used for various control characters. There's rarely a valid use for these characters in documents, so you can be pretty sure that a document using these characters is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's probably saver to assume windows-1252. If you want an exception to be thrown, you'll need to implement your own codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself, because I do such a test in one of my projects, too ;) > The same happens if I use 'latin-1' instead of 'iso8859_1'. > > This caught me by surprise, since I was doing some heuristics guessing > string encodings, and 'iso8859_1' gave no errors even if the input > encoding was different. > > Is this a known behaviour, or I discovered a terrible unknown bug in > python encoding implementation that should be immediately reported and > fixed? :-) > > > happy new year, > -- Benjamin Niemann Email: pink at odahoda dot de WWW: http://www.odahoda.de/ -- http://mail.python.org/mailman/listinfo/python-list