>> How can I pack a unicode string using the struct module? If I simply use
>> packed = struct.pack(fmt, hello) in the code below (and 'hello' is a >> unicode string), I get this: "error: argument for 's' must be a string". I >> keep reading that I have to encode it to a utf-8 bytestring, but this does >> not work (it yields mojibake and tofu output for some of the languages). > >You keep reading it because it is the right approach. You will not get >mojibake if you decode the "packed" data before using it. > >Your code basically becomes > >for greet in greetings: > language, chars, encoding = greet > hello = "".join([unichr(i) for i in chars]) > packed = hello.encode("utf-8") > unpacked = packed.decode("utf-8") > print unpacked > >I don't know why you mess with byte order, perhaps you can tell a bit about >your actual use-case. Hi Peter, Thanks for helping me. I am writing binary files and I wanted to create test data for this. --this has been a good test case, such that (a) it demonstrated a defect in my program (b) idem, my knowledge. I realize how cp2152-ish I am; for instance, I wrongly tend to assume that len(someUnicodeString) == nbytes_of_that_unicode_string. --re: messing with byte order: I read in M. Summerfield's "Programming in Python 3" that it's advisable to always specify the byte order, for portability of the data. But, now that you mention it, the way I did it, I might as well omit it. Or, given that the binary format I am writing contains information about the byte order, I might hard-code the byte order (e.g. always write LE). That would follow Mark Summerfield's advise, if I understand it correctly. --(Aside from your advise to use utf-8) Given that sys.maxunicode == 65535 on my system (ie, that many unicode points can be represented in my compilation of Python) I'd expect that I not only could write u'blaah'.encode("unicode-internal"), but also u'blaah'.encode("ucs-2") Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> u'blaah'.encode("ucs-2") LookupError: unknown encoding: ucs-2 Why is the label "unicode-internal" to indicate both ucs-2 and ucs-4? And why does the same Python version on my Linux computer use 1114111 code points? Can we conclude that Linux users are better equiped to write a letter in Birmese or Aleut? ;-) Thanks again! Regards, Albert-Jan _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor