Re: Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?

2010-10-29 Thread Dan Muey
On Oct 29, 2010, at 2:30 AM, Aristotle Pagaltzis wrote:

> * Dan Muey  [2010-10-28 21:55]:
>> For example, note the differences in output between a unicode
>> string and a byte string regarding character 257, as a unicode
>> string it is 257, as a byte string it is 196.
> 
> That is not what’s going on.
> 
>$ perl -E'say ord "1234"'
>49
> 
> When you pass a multi-character string to `ord`, you get the code
> point of the first character.

Thank you for clarifying what I was highlighting. 

> You are missing the rest of the bytes from the UTF-8 encoding.
> 
> You are losing data.

Thanks, I do understand that and appreciate you expounding it for me further. 
Allow me to explain why this question came up:

I am using Scalar::Quote on byte strings and it uses ord() to determine if it 
will use byte string grapheme notation (e.g. \xE3\x8A\xB7) or unicode string 
notation (e.g. \x{32B7}).

multivac:~ dmuey$ perl -MScalar::Quote=Q -E 'say Q("Perl is the ㊷™");'
"Perl is the \xe3\x8a\xb7\xe2\x84\xa2"
multivac:~ dmuey$ 

multivac:~ dmuey$ perl -E 'say "Perl is the \xe3\x8a\xb7\xe2\x84\xa2";'
Perl is the ㊷™
multivac:~ dmuey$

It appears to do what I need assuming 2 things:
 a) the string is a byte string 
 (e.g. perl -MScalar::Quote=Q -E 'say Q("Perl is the \x{32b7}\x{2122}");')
 b) we are not under "use utf8"
 (e.g. perl -MScalar::Quote=Q -E 'use utf8; say Q("Perl is the ㊷™");')

 I just wanted to verify that it's use of ord() in it's logic wouldn't 
unexpectedly  result in me getting back \x{32B7} under some weird circumstance 
I overlooked.

Thanks again, everyone. I really appreciate it!

--
Dan Muey

Re: Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?

2010-10-29 Thread Aristotle Pagaltzis
* Dan Muey  [2010-10-28 21:55]:
> For example, note the differences in output between a unicode
> string and a byte string regarding character 257, as a unicode
> string it is 257, as a byte string it is 196.

That is not what’s going on.

$ perl -E'say ord "1234"'
49

When you pass a multi-character string to `ord`, you get the code
point of the first character.

$ perl -E'say chr 49'
1

In your case you get 196. That is 0xC4, or the character Ä. It is
not the character ā (U+101 = code point 257).

0xC4 is the value of the first byte in the two-byte UTF-8
sequence that encodes the character 257. You are passing a string
containing a representation of those bytes as two characters to
`ord`, and `ord` is giving you the code point of the first
byte-as-character.

You are missing the rest of the bytes from the UTF-8 encoding.

You are losing data.

If you try this on more code points you will find that there are
*lots* of different characters that are reported as 196 – because
they get encoded as multi-byte sequences that all start with the
byte value 0xC4.

-- 
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined 
wantarray]/e;chop;$_}
&Just->another->Perl->hack;
#Aristotle Pagaltzis //