Char encoding

Nick Ing-Simmons Sat, 05 Aug 2000 04:22:55 -0700
Benjamin Stuhl <[EMAIL PROTECTED]> writes:
>> No, that's the beauty of utf8: the C datatype is still
>> char* and as long
>> as you stick to 7-bits ASCII you won't know the
>> difference. wchar_t
>> comes from a totally different school of thought, where
>> all your strings
>> are instantly incompatible and take twice or four times
>> the memory.
>> 
>> Larry knew what he was doing when he decided on utf8.
>
>It has also led to the perl5 internals being, to put it
>bluntly, a horrible mess. 

Agreed - but that is due to grafting it in late - and possibly 
trying to be too clever intuiting whether existing perl5-code is 
working on bytes or chars.

But the goal was to avoid a 100Mbyte ASCII "string" becoming a 400Mbyte
UTF32 "string" with 300Mbytes of 0x000000.

>And forget about the regex
>engine.

We cannot do that ;-) 
Perhaps the regex engine should always force UF8 form ?


>Perhaps if it was designed in from the beginning things
>would be better, 

That is _our_ job - to make it better.

>but this is something that needs serious 
>discussion.

Consider it started ...

-- 
Nick Ing-Simmons
Char encoding

Reply via email to