Chris Curvey wrote: > Hey all, > > I'm trying to write something that will "fail fast" if one of my users > gives me non-latin-1 characters. So I tried this: > > >>> testString = "\x80" > >>> foo = unicode(testString, "latin-1") > >>> foo > u'\x80' > > I would have thought that that should have raised an error, because > \x80 is not a valid character in latin-1 (according to what I can > find). Is this the expected behavior, or am I missing something?
Depends on what you call 'latin-1'. The standard ISO 8859-1 defined only displayable characters. If you used that definition, even the basic ASCII carriage return, line feed and tab would raise an error. However, according to wikipedia: """In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the code values 00-1F, 7F, and 80-9F. It thus provides for 256 characters via every possible 8-bit value.""" 'latin-1' and 'iso-8859-1' are the same encoding. If you articulate your definition of "valid latin-1", we should be able to help you with some Python code to check it for you. > > I'm on Windows, but I have explicitly set the character set to be > latin-1 in sitecustomize.py Why?? Don't do that. That's a self-inflicted double whammy. (1) You should *not* assume that all the legacy str data your machine will ever process is in only one encoding. (2) On a Windows machine, your legacy data is extremely likely to be encoded in a Microsoft-developed encoding (like cp1252), not latin-1. > > >>> import sys > >>> sys.getdefaultencoding() > 'latin-1' HTH, John -- http://mail.python.org/mailman/listinfo/python-list