<snip> > > * some encodings are more compact than others (e.g. Latin-1 uses > one byte per character, while UTF-32 uses four bytes per > character).
I read that performance of UTF32 is better ("UTF-32 advantage: you don't need to decode stored data to the 32-bit Unicode code point for e.g. character by character handling. The code point is already available right there in your array/vector/string."). http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32 But given that utf-32 is a memory hog, should one conclude that it's usually not a good idea to use it (esp. in Python)? >> but this does not work (it yields mojibake and tofu output for >> some of the languages). > > It would be useful to see an example of this. > > But if you do your encoding/decoding correctly, using the right > codecs, you should never get mojibake. You only get that when > you have a mismatch between the encoding you think you have and > the encoding you actually have. > > >> It's annoying if one needs to know the encoding in which each >> individual language should be represented. I was hoping >> "unicode-internal" was the way to do it, but this does not >> reproduce the original string when I unpack it.. :-( > > Yes, encodings are annoying. The sooner that all encodings other > than UTF-8 and UTF-32 disappear the better :) So true ;-) > The beauty of using UTF-8 instead of one of the many legacy > encodings is that UTF-8 can represent any character, so you don't > need to care about the individual language, and it is compact (at > least for Western European languages). Later you write "You need a variable-length struct, of course.". Is this because ASCII is a subset of UTF-8? The thing is, the the binary format I am writing (spss .sav), uses *fixed* column widths. This means that, even when I only use the ascii subset of utf-8, I still need to assume the worst-case-scenario, namely 3 bytes per symbol, right? > Why are you using struct for this? If you want to convert Unicode > strings into a sequence of bytes, that's exactly what the encode > method does. There's no need for struct. I am using struct to read/write binary data. I created the ' greetings' code to test my program (and my knowledge). As I said to Peter Otten, both were/are imperfect ;-). Struct needs a bytestring, not a unicode string, hence I needed to convert my unicode strings first. I used these languages because I suspected I often get away with errors because 'my' encoding (cp1252) is fairly easy. > greetings = [ > ('Arabic', > u'\u0627\u0644\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645', > 'cp1256'), > ('Assamese', > u'\u09a8\u09ae\u09b8\u09cd\u0995\u09be\u09f0', > 'utf-8'), > ('Bengali', > u'\u0986\u09b8\u09b8\u09be\u09b2\u09be\u09ae\u09c1 > \u0986\u09b2\u09be\u0987\u0995\u09c1\u09ae', > 'utf-8'), > ('English', u'Greetings and salutations', > 'ascii'), > ('Georgian', > u'\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0', > 'utf-8'), > ('Kazakh', > u'\u0421\u04d9\u043b\u0435\u043c\u0435\u0442\u0441\u0456\u0437 > \u0431\u0435', 'utf-8'), > ('Russian', > u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435', > 'utf-8'), > ('Spanish', u'\xa1Hola!', 'cp1252'), > ('Swiss German', u'Gr\xfcezi', 'cp1252'), > ('Thai', > u'\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35', > 'cp874'), > ('Walloon', u'Bondjo\xfb', 'cp1252'), > ] > for language, greet, encoding in greetings: > print u"Hello in %s: %s" % (language, greet) > for enc in ('utf-8', 'utf-16', 'utf-32', encoding): > bytestring = greet.encode(enc) > print "encoded as %s gives %r" % (enc, bytestring) > if bytestring.decode(enc) != greet: > print "*** round-trip encoding/decoding failed ***" > > > Any of the byte strings can then be written directly to a file: > > f.write(bytestring) > > or embedded into a struct. You need a variable-length struct, of course. See above. I believe I've got it working for character data already; now I still need to check whether I can also store e.g. Chinese metadata in my spss file. > My advice: stick to Python unicode strings internally, and always write > them to files as UTF-8. Thanks Steven, I appreciate it! _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor