Benjamin Stuhl <[EMAIL PROTECTED]> writes:
>> No, that's the beauty of utf8: the C datatype is still
>> char* and as long
>> as you stick to 7-bits ASCII you won't know the
>> difference. wchar_t
>> comes from a totally different school of thought, where
>> all your strings
>> are instantly incompatible and take twice or four times
>> the memory.
>>
>> Larry knew what he was doing when he decided on utf8.
>
>It has also led to the perl5 internals being, to put it
>bluntly, a horrible mess.
Agreed - but that is due to grafting it in late - and possibly
trying to be too clever intuiting whether existing perl5-code is
working on bytes or chars.
But the goal was to avoid a 100Mbyte ASCII "string" becoming a 400Mbyte
UTF32 "string" with 300Mbytes of 0x000000.
>And forget about the regex
>engine.
We cannot do that ;-)
Perhaps the regex engine should always force UF8 form ?
>Perhaps if it was designed in from the beginning things
>would be better,
That is _our_ job - to make it better.
>but this is something that needs serious
>discussion.
Consider it started ...
--
Nick Ing-Simmons