Am 04.05.2010 um 11:09 schrieb Gisle Aas:
> I regret that I let \C sneak into the URI module. Now we have an interface
> that depends on the internal UTF-8 flag of the stings passed in.
Does it? How so? If it's a byte string, well, it's a byte string, and \C
doesn't change that. If, on the other hand, it's a text string, \C forces byte
semantics upon it. Isn't that what you want to do in that function? (Okay,
there's no spec for that function, so I don't really know what you want to do.)
But doesn't the function return the same result regardless of the UTF-8 flag
being set or not? As demonstrated by this test script:
use strict;
use warnings;
use utf8; # source in UTF-8
use Encode;
binmode STDOUT, ':utf8'; # terminal UTF-8
my $text = 'Käse'; # all characters below 256
my $bytes = encode_utf8 $text;
my $text2 = 'Jiří'; # some characters above 255
my $bytes2 = encode_utf8 $text2;
printf "%x %s\n", ord $_, $_ for
$text,
$text =~ m/(\C)/g,
$bytes,
$bytes =~ m/(\C)/g,
$text2,
$text2 =~ m/(\C)/g,
$bytes2,
$bytes2 =~ m/(\C)/g;
> This makes it very hard to explain, makes it not do what you want when
> different type of strings are combined and makes it hard to fix in ways that
> don't break some code.
Could you provide an example of how this might not do what you want when
different types of strings are combined?
> My plan for fixing this is to introduce URI::IRI with an interface that
> encode all non-URI characters as percent-encoded UTF-8 and live with the
> inconsistency for URI (until Perl redefine what \C means).
--
Michael.Ludwig (#) XING.com