Re: Don't use the \C escape in regexes - Why not?

Michael Ludwig Tue, 04 May 2010 02:42:32 -0700

Am 04.05.2010 um 11:09 schrieb Gisle Aas:

> I regret that I let \C sneak into the URI module.  Now we have an interface 
> that depends on the internal UTF-8 flag of the stings passed in.


Does it? How so? If it's a byte string, well, it's a byte string, and \C 
doesn't change that. If, on the other hand, it's a text string, \C forces byte 
semantics upon it. Isn't that what you want to do in that function? (Okay, 
there's no spec for that function, so I don't really know what you want to do.) 
But doesn't the function return the same result regardless of the UTF-8 flag 
being set or not? As demonstrated by this test script:

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode;
binmode STDOUT, ':utf8'; # terminal UTF-8
my $text   = 'Käse'; # all characters below 256
my $bytes  = encode_utf8 $text;
my $text2  = 'Jiří'; # some characters above 255
my $bytes2 = encode_utf8 $text2;
printf "%x %s\n", ord $_, $_ for
    $text,
    $text =~ m/(\C)/g,
    $bytes,
    $bytes =~ m/(\C)/g,
    $text2,
    $text2 =~ m/(\C)/g,
    $bytes2,
    $bytes2 =~ m/(\C)/g;


> This makes it very hard to explain, makes it not do what you want when 
> different type of strings are combined and makes it hard to fix in ways that 
> don't break some code.

Could you provide an example of how this might not do what you want when 
different types of strings are combined?

> My plan for fixing this is to introduce URI::IRI with an interface that 
> encode all non-URI characters as percent-encoded UTF-8 and live with the 
> inconsistency for URI (until Perl redefine what \C means).


-- 
Michael.Ludwig (#) XING.com

Re: Don't use the \C escape in regexes - Why not?

Reply via email to