On 09.01.2014 22:45, Antoine Pitrou wrote: > On Thu, 9 Jan 2014 13:36:05 -0800 > Chris Barker <chris.bar...@noaa.gov> wrote: >> >> Some folks have suggested using latin-1 (or other 8-bit encoding) -- is >> that guaranteed to work with any binary data, and round-trip accurately? > > Yes, it is.
Just a word of caution: Using the 'latin-1' to mean unknown encoding can easily result in Mojibake (unreadable text) entering your application with dangerous effects on your other text data. E.g. "Marc-André" read using 'latin-1' if the string itself is encoded as UTF-8 will give you "Marc-André" in your application. (Yes, I see that a lot in applications and websites I use ;-)) Also note that indexing based on code points will likely break that way as well, ie. if you pass an index to an application based on what you see in your editor or shell, those indexes can be wrong when used on the encoded data. UTF-8 is an example of a popular variable length encoding for Unicode, so you'll hit this problem whenever dealing with non-ASCII UTF-8 data. >> and will surrogateescape work for arbitrary binary data? > > Yes, it will. The surrogateescape trick only works if you are encoding your work using the same encoding that you used for decoding it. Otherwise, you'll get a mix of the input encoding and the output encoding as output. Note that the error handler trick has an advantage over the latin-1 trick: if you try to encode a Unicode string with escape surrogates without using the error handler, it will fail, so you at least know that there are "funny" code points in your output string that need some extra care. BTW: Perhaps it would be a good idea to backport the surrogateescape error handler to Python 2.7 to simplify writing code which works in both Python 2 and 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2014) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com