On Wed, 27 May 2015 09:15 pm, anatoly techtonik wrote: > Hi. > > This was labelled offtopic in python-ideas, so I edited and forwarded > it here. Please CC as I am not subscribed. > > > In short. I need is a bulletproof way to convert from anything to > unicode. This requires some kind of escaping to go forward and back.
Why do you need to go back? Just keep the node, and use that. > Some helper function like u2b() (unicode to binary) and b2u() (that > also removes escaping). So far I can't find any code that does just > that. def bytes2unicode(bytes): # Converts bytes to Unicode, allowing garbage (moji-bake). return bytes.decode('latin1') def unicode2bytes(unicode): # Convert unicode containing garbage (moji-bake) to bytes. return unicode.encode('latin1') It correctly does the round trip from any sequence of bytes to unicode and back to bytes, losslessly: py> import random py> node = bytes([random.randrange(0, 256) for _ in range(100000)]) py> uni = bytes2unicode(node) py> b = unicode2bytes(uni) py> b == node True But take careful note that you can't start with Unicode and still expect to round-trip losslessly. Many perfectly readable Unicode strings do *not* convert to bytes: py> unicode2bytes(u'ДЙ') # two Cyrillic letters Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 3, in unicode2bytes UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256) That means that if you take a correctly encoded string, it will round-trip, but it will also display as garbage: py> s = u'ДЙ' py> node = s.encode('utf-8') py> print(node) # Correctly encoded UTF-8 b'\xd0\x94\xd0\x99' py> node == unicode2bytes(bytes2unicode(node)) # round trips okay True py> print(repr(bytes2unicode(node))) # but prints as crap 'Ð\x94Ð\x99' > Background story. I need to print SCons graph. SCons is a build tool, > so it has a graph of nodes - what depends on what. I have no idea > what a node object could be. I know only that it can have human > readable representation. Sometimes node is a filename in some > encoding that is not utf-8, and without knowing the encoding, > converting it to unicode is not possible without loosing the information > about that filename. py> filename = "My Russian ДЙ name" # Unicode py> b = filename.encode('koi8-r') # Oops, not UTF-8! py> b.decode("utf-8") # Fails Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 11: invalid continuation byte py> b.decode("utf-8", errors="replace") # lossy, but works 'My Russian �� name' py> s = b.decode("utf-8", errors="surrogateescape") # magic! py> s 'My Russian \udce4\udcea name' It round-trips as well: py> s.encode("utf-8", errors="surrogateescape") == b True Converting this back to Python 2.7 is left as an exercise for the reader. -- Steven -- https://mail.python.org/mailman/listinfo/python-list