Re: Unicode handling

Dan Sugalski Tue, 27 Mar 2001 09:20:54 -0800
At 08:37 PM 3/26/2001 +0000, [EMAIL PROTECTED] wrote:
>Damien Neil <[EMAIL PROTECTED]> writes:
> >> >So $c = chr(ord($c)) could change $c?  That seems odd.
> >>
> >> It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC)
> >> but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness.
> >> Then of course someone will want it to be the number 0x45 and not do
> >> that 'cos they are using chr/ord to mess with JPEG image data...
> >> So there needs to be a 'binary' encoding which they can use.
> >
> >That doesn't seem to be what Dan was saying, however.
>
>And Dan is the one "in charge" on this list - so my perl5.7-ish view
>may be wrong.

"In charge" is such a strong phrase. (And not what I thought the job 
originally was, but that's a separate issue...)

> >It would make
> >perfect sense to me for chr(ord($c)) to return $c in a different
> >encoding.  (Assuming, of course, that $c is a single character.)
> >
> >Assume ord is dependent on the current default encoding.
> >
> >  use utf8; # set default encoding.
> >  my $e : ebcdic = 'a';
> >  my $u = chr(ord($e));
> >
> >If ord is dependent on the current default encoding, I would expect
> >the above to leave the UTF-8 string "a" in $u.  This makes sense to
> >me.
>
>Good.

I'm afraid this isn't what I'd normally think of--ord to me returns the 
integer value of the first code point in the string. That does mean that A 
is different for ASCII and EBCDIC, but that's just One Of Those Things.

The alternative is for us to do data conversions some times (when we're 
pulling data out of an EBCDIC or Shift-JIS string in a Unicode block) but 
not others (when we're pulling binary data out in a Unicode or EBCDIC 
block). That seems a little off to me, but I could well be wrong. It also 
means we may well mangle data that's incorrectly tagged--if, for example an 
input filter tagged binary data with a non-binary type, which isn't that 
unlikely.

> >If ord is dependent on the encoding of the string it gets, as Dan
> >was saying, than ord($e) is 0x81,
>
>It it could still be 0x81 (from ebcdic) with the encoding carried
>along with the _number_ if we thought that worth the trouble.
>(It isn't too bad for assignment but is far from clear
>    what
>      2 (ebcdic) * 0xA1(iso_8859_7)
>might mean - perhaps we drop the tag if anything other the + or - happens.

Or what we do with it if it's stringified. The only thing I can see keeping 
the tag around for would be later chr() and pack() calls, and that doesn't 
seem like it'd happen often enough to justify the overhead. Could be wrong, 
though.

> >and $u is "\x81".  This seems
> >strange.
> >
> >Hmm.  It suddenly occurs to me that I may have been misinterpreting:
> >ord is dependent on both the encoding of its argument (to determine
> >the logical character containing in that argument) and the current
> >default encoding (to determine the value in the current character set
> >representing that character).

That wasn't my intention. I was thinking that chr was bound to the current 
default encoding, and ord was bound to the string type of the scalar being 
ord-ed.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: Unicode handling

Reply via email to