Re: [pugs] regexp "bug"?

mark . a . biggar Fri, 15 Apr 2005 10:13:08 -0700

Isn't that what the difference between byte-level and codepoint-level access to 
strings is all about.  If you want to work with values that are illegal 
codepoints then you should be working at the byte-level not the 
codepoint-level, at least by default.


--
Mark Biggar
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]


> On Fri, Apr 15, 2005 at 12:56:14AM -0700, Mark A. Biggar wrote:
> : Yes, the value 0xFFFF can be stored as either 3 byte UTF-8 string or a 2 
> : byte UCS-2 value, but the Unicode standard specifically says that the 
> : values 0xFFFF, 0xFFFE and 0xFEFF are NOT valid codepoints and should 
> : never appear in a Unicode string.  0xFFFF is reserved for out-of-band 
> : signaling (such the -1 returnd by getc()) and 0xFFFE and 0xFEFF are 
> : specificaly reserved for out-of-band marking a UCS-2 file as being 
> : either bigendian or littlendian, but are specifically not considered 
> : part of the data.  chr() is currently defined to mean convert an int 
> : value to a Unicode codepoint. That's why I said that chr(65535) should 
> : return an exception, it's an argument error similar to sqrt(-1).
> 
> It has to at least be possible to Think Bad Thoughts in Perl.
> It doesn't have to be the default, though.  But there has to be
> some way of allowing illegal characters to be talked about, or
> you can't write programs that talk about them.  It's like saying
> it's okay to be an executioner as long as you don't kill anyone...
> 
> Larry

Re: [pugs] regexp "bug"?

Reply via email to