moin Mathieu, moin all, On 2009-01-15 20:45:13, Mathieu Bouchard <ma...@artengine.ca> appears to have written: > On Thu, 15 Jan 2009, Bryan Jurish wrote: > >> byte-strings are IMHO the more basic representation (a >> char* is still a char*, even in this post-unicode world). > > What happened is that people switched to UTF-8 instead of some > fixed-size encoding because many apps that assume that a character is a > byte will work anyway.
UTF-8 also does a pretty good job of compactly representing latin character sets for natural language data, where non-ASCII characters tend to be relatively infrequent anyways. UTF-16 and UTF-32 are pretty wasteful in these cases. (Of course, I'm biting my own tail with this point, since the [pdstring] representation is even more wasteful than UTF-32 ;-) > Just don't ask those apps to say how many > characters there are in a string though. You have to pretend that all > the "special" characters are pairs of characters instead (when they are > not triplets). Indeed. Ugly but true. > I gather that it'll take a long time before Pd gets unicode support... I suspect you're right. >> ... except if you're building rsp. reading a persistent index for a >> large file, in which case tell() & seek() are likely to be a wee bit >> faster than parsing and counting variable-length-encoded characters ... > > right. ... or calling malloc(), or doing pretty much any other low-level fiddly stuff ... marmosets, Bryan -- Bryan Jurish "There is *always* one more bug." jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology _______________________________________________ Pd-list@iem.at mailing list UNSUBSCRIBE and account-management -> http://lists.puredata.info/listinfo/pd-list