Hi, How can I pack a unicode string using the struct module? If I simply use packed = struct.pack(fmt, hello) in the code below (and 'hello' is a unicode string), I get this: "error: argument for 's' must be a string". I keep reading that I have to encode it to a utf-8 bytestring, but this does not work (it yields mojibake and tofu output for some of the languages). It's annoying if one needs to know the encoding in which each individual language should be represented. I was hoping "unicode-internal" was the way to do it, but this does not reproduce the original string when I unpack it.. :-(
# Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32 import sys import struct greetings = \ [['Arabic', [1575, 1604, 1587, 1604, 1575, 1605, 32, 1593, 1604, 1610, 1603, 1605], 'cp1256'], # 'cp864' 'iso8859_6' ['Assamese', [2472, 2478, 2488, 2509, 2453, 2494, 2544], 'utf-8'], ['Bengali', [2438, 2488, 2488, 2494, 2482, 2494, 2478, 2497, 32, 2438, 2482, 2494, 2439, 2453, 2497, 2478], 'utf-8'], ['Georgian', [4306, 4304, 4315, 4304, 4320, 4335, 4317, 4305, 4304], 'utf-8'], ['Kazakh', [1057, 1241, 1083, 1077, 1084, 1077, 1090, 1089, 1110, 1079, 32, 1073, 1077], 'utf-8'], ['Russian', [1047, 1076, 1088,1072, 1074, 1089, 1090, 1074, 1091, 1081, 1090, 1077], 'utf-8'], ['Spanish', [161, 72, 111, 108, 97, 33], 'cp1252'], ['Swiss German', [71, 114, 252, 101, 122, 105], 'cp1252'], ['Thai', [3626, 3623, 3633, 3626, 3604, 3637], 'cp874'], ['Walloon', [66, 111, 110, 100, 106, 111, 251], 'cp1252']] for greet in greetings: language, chars, encoding = greet hello = "".join([unichr(i) for i in chars]) #print language, hello, encoding # prints everything as it should look endianness = "<" if sys.byteorder == "little" else ">" fmt = endianness + str(len(hello)) + "s" #https://code.activestate.com/lists/python-list/301601/ #http://bytes.com/topic/python/answers/546519-unicode-strings-struct-files #packed = struct.pack(fmt, hello.encode('utf_32_le')) #packed = struct.pack(fmt, hello.encode(encoding)) #packed = struct.pack(fmt, hello.encode('utf_8')) packed = struct.pack(fmt, hello.encode("unicode-internal")) print struct.unpack(fmt, packed)[0].decode("unicode-internal") # UnicodeDecodeError: 'unicode_internal' codec can't decode byte 0x00 in position 12: truncated input Thank you in advance! Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor