BÁRTHÁZI András wrote:

Hi,

>> This code:
>>
>> my $a='A';
>> $a ~~ s:perl5:g/A/{chr(65535)}/;
>> say $a.bytes;
>>
>> Outputs "0". Why?
>
>
> \uFFFF is not a legal unicode codepoint. chr(65535) should raise an exception of some type. So the above code does seem show a possible bug. But as that chr(65535) is an undefined char, who knows what the code is acually doing.



In my opinion (that can be wrong), \uFFFF can be stored as an UTF-8 character, it should be 0xEF~0xBF~0xBF. If I do it outside the regexp (I mean "say chr(65535).bytes", it works well.


Another "bug", I've found, it's not related to the regexps, but still unicode character one:

  say chr(0x10FFFF).bytes;

The answer:

  pugs: encodeUTF8: ord returned a value above 0x10FFFF

And if I start to increment $b, I will get:

  pugs: Prelude.chr: bad argument

I don't understand it, as I thougth that unicode characters in the range of 0x00000000-0x7FFFFFFF. Is Haskell not supporting the whole set?

There is a Unicode version, called UCS-2, that is just between 0x0000-0xFFFF, but it still not answer the question.

[...]

Meanwhile, I've found this:
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2175.htm

It can be the answer to my question.

Yes, the value 0xFFFF can be stored as either 3 byte UTF-8 string or a 2 byte UCS-2 value, but the Unicode standard specifically says that the values 0xFFFF, 0xFFFE and 0xFEFF are NOT valid codepoints and should never appear in a Unicode string. 0xFFFF is reserved for out-of-band signaling (such the -1 returnd by getc()) and 0xFFFE and 0xFEFF are specificaly reserved for out-of-band marking a UCS-2 file as being either bigendian or littlendian, but are specifically not considered part of the data. chr() is currently defined to mean convert an int value to a Unicode codepoint. That's why I said that chr(65535) should return an exception, it's an argument error similar to sqrt(-1).


--
[EMAIL PROTECTED]
[EMAIL PROTECTED]

Reply via email to