Simon Cozens <[EMAIL PROTECTED]> writes:
>
>So, before we start even thinking about what we need, it's time to look at the
>vexed question of string representation. How do we do Unicode without getting
>into the horrendous non-Latin1 cockups we're seeing on p5p right now?
Well - my theorist's answer is that everything is Unicode - like Java.
As I pointed out on p5p even EBCDIC machines can use that model - but
the downside is that ord('A') == 65 which will breaks backward compatibility
with EBCDIC scripts.
If perl5.7+ EBCDIC continues down its alternate road
and we need to be able to translate perl5 -> perl6 I strongly suspect
that perl6 cannot use the "java-oid" model either as the programmer's
intent will not be obvious enough to auto-translate.
I still haven't grasped what the current EBCDIC "model as seen by perl
programmer" _is_.
>Larry
>suggested aeons ago that everything is an array of numbers, and Perl shouldn't
>care what those numbers represent. But at some point, it has to, and that
>means things have to be tagged with their character repetoires and encodings.
Tagging a string with a repertoire and encoding is horrible - you are aware
of the trickyness of even getting the SvUTF8 bit "right". To have
a general representation carried around we need a pointer rather just a bit
and we cannot say
if (SvUTF8(sv))
we have to say
if (SvENCODING(sv)->some_predicate)
e.g.
if (SvENCODING(sv_a) != SvENCODING(sv_b))
{
if (SvENCODING(sv_a)->is_superset_of(SvENCODING(sv_b))
{
sv_upgrade_to(sv_b,SvENCODING(sv_a));
}
elsif if (SvENCODING(sv_b)->is_superset_of(SvENCODING(sv_a))
{
sv_upgrade_to(sv_a,SvENCODING(sv_b));
}
else
{
Encoding *x = find_superset_encoding(SvENCODING(sv_a),SvENCODING(sv_b))
sv_upgrade_to(sv_a,x);
sv_upgrade_to(sv_b,x);
}
}
Personally I would not use such a beast
The only sane compromise I can imagine is close to what we have at the
moment with maybe a few extra special cases in the "flags" bits:
ASCII only (0..7f)
Native-single-byte (iso8859-x, IBM1047)
wchar_t
UTF-8
UNICODE
There needs to be a hierachy of _repertoires_ such that:
ASCII is subset of Native is subset of wchar_t is subset of UNICODE.
The "Native-single-byte" would have one - global-to-interpreter
encoding object - not just iso8859-1 - basically the one that LC_CTYPE
gives the "right answers for" - though how the "£!$^¬!*% one is supposed
to find that out is beyond me - so we would presumably invert that
and use the Unicode CTYPE-oid stuff to do isALPHA() etc.
--
Nick Ing-Simmons <[EMAIL PROTECTED]>
Via, but not speaking for: Texas Instruments Ltd.