Re: String representation

Nick Ing-Simmons Mon, 18 Dec 2000 07:09:41 -0800
Simon Cozens <[EMAIL PROTECTED]> writes:
>
>So, before we start even thinking about what we need, it's time to look at the
>vexed question of string representation. How do we do Unicode without getting
>into the horrendous non-Latin1 cockups we're seeing on p5p right now? 

Well - my theorist's answer is that everything is Unicode - like Java.
As I pointed out on p5p even EBCDIC machines can use that model - but 
the downside is that ord('A') == 65 which will breaks backward compatibility 
with EBCDIC scripts. 

If perl5.7+ EBCDIC continues down its alternate road
and we need to be able to translate perl5 -> perl6 I strongly suspect 
that perl6 cannot use the "java-oid" model either as the programmer's
intent will not be obvious enough to auto-translate.
I still haven't grasped what the current EBCDIC "model as seen by perl
programmer" _is_.

>Larry
>suggested aeons ago that everything is an array of numbers, and Perl shouldn't
>care what those numbers represent. But at some point, it has to, and that
>means things have to be tagged with their character repetoires and encodings.

Tagging a string with a repertoire and encoding is horrible - you are aware 
of the trickyness of even getting the SvUTF8 bit "right". To have 
a general representation carried around we need a pointer rather just a bit
and we cannot say 
   if (SvUTF8(sv))

we have to say 

   if (SvENCODING(sv)->some_predicate)

e.g. 

   if (SvENCODING(sv_a) != SvENCODING(sv_b))
    {
     if (SvENCODING(sv_a)->is_superset_of(SvENCODING(sv_b))
      {
       sv_upgrade_to(sv_b,SvENCODING(sv_a));
      }
     elsif if (SvENCODING(sv_b)->is_superset_of(SvENCODING(sv_a))
      {
       sv_upgrade_to(sv_a,SvENCODING(sv_b));
      }
     else
      {
       Encoding *x = find_superset_encoding(SvENCODING(sv_a),SvENCODING(sv_b))
       sv_upgrade_to(sv_a,x);
       sv_upgrade_to(sv_b,x);
      }
    } 

Personally I would not use such a beast 

The only sane compromise I can imagine is close to what we have at the 
moment with maybe a few extra special cases in the "flags" bits:
   ASCII only           (0..7f)
   Native-single-byte   (iso8859-x, IBM1047)
   wchar_t 
   UTF-8
   UNICODE

There needs to be a hierachy of _repertoires_ such that:

ASCII is subset of Native is subset of wchar_t is subset of UNICODE.


The "Native-single-byte" would have one - global-to-interpreter
encoding object - not just iso8859-1 - basically the one that LC_CTYPE
gives the "right answers for" - though how the  "Ł!$^Ź!*% one is supposed 
to find that out is beyond me - so we would presumably invert that 
and use the Unicode CTYPE-oid stuff to do isALPHA() etc.

-- 
Nick Ing-Simmons <[EMAIL PROTECTED]>
Via, but not speaking for: Texas Instruments Ltd.
Re: String representation

Reply via email to