moin Mathieu, moin all,

On 2009-01-15 20:45:13, Mathieu Bouchard <ma...@artengine.ca> appears to
have written:
> On Thu, 15 Jan 2009, Bryan Jurish wrote:
> 
>> byte-strings are IMHO the more basic representation (a
>> char* is still a char*, even in this post-unicode world).
> 
> What happened is that people switched to UTF-8 instead of some
> fixed-size encoding because many apps that assume that a character is a
> byte will work anyway. 

UTF-8 also does a pretty good job of compactly representing latin
character sets for natural language data, where non-ASCII characters
tend to be relatively infrequent anyways.  UTF-16 and UTF-32 are pretty
wasteful in these cases.  (Of course, I'm biting my own tail with this
point, since the [pdstring] representation is even more wasteful than
UTF-32 ;-)

> Just don't ask those apps to say how many
> characters there are in a string though. You have to pretend that all
> the "special" characters are pairs of characters instead (when they are
> not triplets).

Indeed.  Ugly but true.

> I gather that it'll take a long time before Pd gets unicode support...

I suspect you're right.

>> ... except if you're building rsp. reading a persistent index for a
>> large file, in which case tell() & seek() are likely to be a wee bit
>> faster than parsing and counting variable-length-encoded characters ...
> 
> right.

... or calling malloc(), or doing pretty much any other low-level fiddly
stuff ...

marmosets,
        Bryan

-- 
Bryan Jurish                           "There is *always* one more bug."
jur...@ling.uni-potsdam.de      -Lubarsky's Law of Cybernetic Entomology


_______________________________________________
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management -> 
http://lists.puredata.info/listinfo/pd-list

Reply via email to