>> How can I pack a unicode string using the struct module? If I simply use
>> packed = struct.pack(fmt, hello) in the code below (and 'hello' is a
>> unicode string), I get this: "error: argument for 's' must be a string". I
>> keep reading that I have to encode it to a utf-8 bytestring, but this does
>> not work (it yields mojibake and tofu output for some of the languages).
>
>You keep reading it because it is the right approach. You will not get
>mojibake if you decode the "packed" data before using it.
>
>Your code basically becomes
>
>for greet in greetings:
> language, chars, encoding = greet
> hello = "".join([unichr(i) for i in chars])
> packed = hello.encode("utf-8")
> unpacked = packed.decode("utf-8")
> print unpacked
>
>I don't know why you mess with byte order, perhaps you can tell a bit about
>your actual use-case.
Hi Peter,
Thanks for helping me. I am writing binary files and I wanted to create test
data for this.
--this has been a good test case, such that (a) it demonstrated a defect in my
program (b) idem, my knowledge. I realize how cp2152-ish I am; for instance, I
wrongly tend to assume that len(someUnicodeString) ==
nbytes_of_that_unicode_string.
--re: messing with byte order: I read in M. Summerfield's "Programming in
Python 3" that it's advisable to always specify the byte order, for portability
of the data. But, now that you mention it, the way I did it, I might as well
omit it. Or, given that the binary format I am writing contains information
about the byte order, I might hard-code the byte order (e.g. always write LE).
That would follow Mark Summerfield's advise, if I understand it correctly.
--(Aside from your advise to use utf-8) Given that sys.maxunicode == 65535 on
my system (ie, that many unicode points can be represented in my compilation of
Python) I'd expect that I not only could write
u'blaah'.encode("unicode-internal"), but also u'blaah'.encode("ucs-2")
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
u'blaah'.encode("ucs-2")
LookupError: unknown encoding: ucs-2
Why is the label "unicode-internal" to indicate both ucs-2 and ucs-4? And why
does the same Python version on my Linux computer use 1114111 code points? Can
we conclude that Linux users are better equiped to write a letter in Birmese or
Aleut? ;-)
Thanks again!
Regards,
Albert-Jan
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor