Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>> Jarkko - do you want to do that for 5.7.1 - i.e. before :utf8 layer
>> sees the light in a "release" ?
>
>I can't say that I have followed the UTFEBCDIC discussion with all
>the attention it would deserve -- but putting on my Joe User hat,
>I would find it at the very least curious if ":utf8" didn't
>mean UTF-8 as it is defined (in an RFC). Then again, I haven't
>read the UTF-EBCDIC Unicode TR (or do they call it an UAX these
>days) for ages. If I had a quick summary of what now happens
>in EBCDIC with :utf8, how is that different from say, UNIX,
>I would have to wave my hands less.
:utf8 as a layer causes perl to sv_utf8_upgrade() strings before writing,
and marks in-coming strings as SvUTF8_on. Thus it makes the handle read/write
perl's internal form and allows "wide" characters.
Now, on ASCII machines the internal form is UTF-8, but on EBCDIC machines
(even ones running something-like-UNIX) it is UTF-EBCDIC (tr16).
(I don't think tr16 has reached UAX status yet.)
In both cases the UTF-* form has the property that common characters
corresponding to at least U+0000 to U+007F are "invariant" compared
to how they are represented on the "native" side. Thus 'A' is 0x41
in ASCII and UTF-8, and 'A' is 0xC1 in EBCDIC and UTF-EBCDIC.
So if you look at a UTF-X file in a text editor you can see perl scripts
and docs in english etc.
But as UTF-EBCDIC is readable in EBCDIC it is obviously not the same
as UTF-8. What is more it encodes to a different length (5-bits per char
rather than UTF-8's 6-bits per char).
Now the intent of ":utf8" layer was to make it simple and efficient
to do the common thing one would do with Unicode text. So making it do
the above is "right" (apart from the name).
With a little copy/paste from utf8.c => Encode.xs it would be easy
enough to make the heavy weight ":encoding()" layer produce either
format on either platform.
--
Nick Ing-Simmons