Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Aristotle Pagaltzis
* Michael Ludwig  [2010-05-04 14:55]:
> But wait a second: While URIs are meant to be made of
> characters, they're also meant to go over the wire, and there
> are no characters on the wire, only bytes. There is no standard
> encoding defined for the wire, although UTF-8 has come to be
> seen as the standard encoding for URIs containing non-ASCII
> characters. Perl having two standard encodings (UTF-8 and
> ISO-8859-1) for text and relying on the internal flag to tell
> which one is meant to matter, shouldn't the URI module either
> only accept bytes or only characters? Or rather, provide two
> different constructors instead of only one trying to be
> intelligent?
>
>  URI->bytes( $bytes ); # byte string
>  URI->chars( $chars ); # character string
>
> And, in addition, define the character encoding used for
> serialization.

Yes, exactly. And both methods would use the moral equivalent of
a plain `split //` – no trickery such as with `\C`. The only
difference between then is that the `chars` method would
`encode_utf8` the string first and then encode it blindly,
whereas the `bytes` method would leave it as is but then croak if
it found a codepoint > 0xFF (since the string is supposed to
represent an octet sequence already).

Notably absent in both cases: any dependence on the state of the
UTF8 flag of the string.

Regards,
-- 
Aristotle Pagaltzis // 


Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Gisle Aas
I regret that I let \C sneak into the URI module.  Now we have an interface 
that depends on the internal UTF-8 flag of the stings passed in.  This makes it 
very hard to explain, makes it not do what you want when different type of 
strings are combined and makes it hard to fix in ways that don't break some 
code.  My plan for fixing this is to introduce URI::IRI with an interface that 
encode all non-URI characters as percent-encoded UTF-8 and live with the 
inconsistency for URI (until Perl redefine what \C means).

--Gisle


On May 3, 2010, at 20:34, Michael Ludwig wrote:

> "Don't use the \C escape in regexes" - taken from Juerd's Unicode Advice page:
> 
>  http://juerd.nl/site.plp/perluniadvice
> 
> Why not?
> 
> -- perldoc perlre:
> \C  Match a single C char (octet) even under Unicode.
>NOTE: breaks up characters into their UTF-8 bytes,
>so you may end up with malformed pieces of UTF-8.
>Unsupported in lookbehind.
> 
> -- URI::Escape
> sub escape_char {
>return join '', @URI::Escape::escapes{$_[0] =~ /(\C)/g};
> }
> 
> The regular expression is used to disassemble an incoming text string into 
> individual bytes (and then use the resulting list in a hash slice). It is a 
> legitimate use case, and the means seems to do the job. What's the problem 
> with the \C escape?
> 
> -- 
> Michael.Ludwig (#) XING.com
> 



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 13:24 schrieb Aristotle Pagaltzis:

> * Michael Ludwig  [2010-05-04 13:10]:
>> Is it this (theoretically fragile) implicitness in handling
>> character strings that makes \C a bad idea?
> 
> Yes. It will do different things with semantically identical
> strings whose only difference is whether the UTF8 flag is set,
> ie. it suffers the same problems that the `bytes` pragma has.

That's not what I meant, but I see your point:

$ cat uri-regex-C.pl 
use strict;
use utf8;
use Encode;
use URI;
use Test::More tests => 2;

my $builder = Test::More->builder;
binmode $builder->$_, ':utf8' for qw/output failure_output todo_output/;
my $txt = 'Käse';
my $iso = encode( 'ISO-8859-1', $txt );;

is $iso, $txt, "strings are equal: $iso = $txt";
my $uri_txt = URI->new( $txt );
my $uri_iso = URI->new( $iso );
# URI overloads stringification
is $uri_iso, $uri_txt, "URIs are equal: $uri_iso = $uri_txt";


$ perl uri-regex-C.pl 
1..2
ok 1 - strings are equal: Käse = Käse
not ok 2 - URIs are equal: K%E4se = K%C3%A4se
#   Failed test 'URIs are equal: K%E4se = K%C3%A4se'
#   at uri-regex-C.pl line 16.
#  got: 'K%E4se'
# expected: 'K%C3%A4se'
# Looks like you failed 1 test of 2.

The strings compare equal, but the URIs derived from them don't.

But wait a second: While URIs are meant to be made of characters, they're also 
meant to go over the wire, and there are no characters on the wire, only bytes. 
There is no standard encoding defined for the wire, although UTF-8 has come to 
be seen as the standard encoding for URIs containing non-ASCII characters. Perl 
having two standard encodings (UTF-8 and ISO-8859-1) for text and relying on 
the internal flag to tell which one is meant to matter, shouldn't the URI 
module either only accept bytes or only characters? Or rather, provide two 
different constructors instead of only one trying to be intelligent?

  URI->bytes( $bytes ); # byte string
  URI->chars( $chars ); # character string

And, in addition, define the character encoding used for serialization.

So, \C implicitly encodes character strings as UTF-8 (Michael), and implicitly 
encodes byte strings as such, which is ISO-8859-1 (Aristoteles).

The input for URI->new is not specified as either character or byte string, and 
the output of URI->as_string is not specified with regard to a wire encoding. 
(But how could it be if the input is not defined?) The perldoc for URI#new 1.54 
only says:

  The set of characters available for building URI references is
  restricted (see URI::Escape). Characters outside this set are
  automatically escaped by the URI constructor.

http://search.cpan.org/dist/URI/URI.pm

What does Java do? The java.net.URI constructors only accept character strings, 
and the wire encoding has been fixed to UTF-8. To quote:

  A character is encoded by replacing it with the sequence
  of escaped octets that represent that character in the UTF-8
  character set. The Euro currency symbol ('\u20AC'), for example,
  is encoded as "%E2%82%AC". (Deviation from RFC 2396, which does
  not specify any particular character set.)

http://java.sun.com/javase/6/docs/api/java/net/URI.html

So documentation and behaviour are very clear in Java.

-- 
Michael.Ludwig (#) XING.com



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Aristotle Pagaltzis
* Michael Ludwig  [2010-05-04 13:10]:
> Is it this (theoretically fragile) implicitness in handling
> character strings that makes \C a bad idea?

Yes. It will do different things with semantically identical
strings whose only difference is whether the UTF8 flag is set,
ie. it suffers the same problems that the `bytes` pragma has.

Regards,
-- 
Aristotle Pagaltzis // 


Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 13:06 schrieb Michael Ludwig:

> Is it this (theoretically fragile) implicitness in handling character strings 
> that makes \C a bad idea?
> 
> But probably not as bad an idea as relying on the default platform encoding 
> in Java ("default charset" in Java API doc lingo), which may be different 
> from country to country and from installation to installation.
> 
> http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29

Or, more symmetrically to encoding via \C in Perl:

http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29

  public byte[] getBytes()
Encodes this String into a sequence of bytes
using the platform's default charset, storing
the result into a new byte array.

Much more serious and real than implicitly encoding via \C in Perl, given the 
fact that Java installations do not all use the same platform encoding, while 
all current Perl installations use the same internal encoding. (All Java 
installations use the same internal encoding of UTF-16, I think, but this fact 
is well hidden from the interface.)

-- 
Michael.Ludwig (#) XING.com



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 11:09 schrieb Gisle Aas:

> I regret that I let \C sneak into the URI module.


I might have understood why one might think that \C is not a good idea to use 
in that method, and maybe not in general.

The fact that character strings in Perl are encoded in UTF-8 is an 
implementation detail, and you shouldn't bother, or make any assumptions about 
this technicality. But by using \C to derive an encoded version - a byte string 
- from a character string (and maybe even taking it for granted you'll get a 
UTF-8 byte string), you're tying your interface to an implementation detail. 
And the behaviour of your code will change as soon as Perl moves on to use, 
say, UTF-16 as the internal encoding. (Which is highly unlikely, but that's 
another story.)

Is it this (theoretically fragile) implicitness in handling character strings 
that makes \C a bad idea?

But probably not as bad an idea as relying on the default platform encoding in 
Java ("default charset" in Java API doc lingo), which may be different from 
country to country and from installation to installation.

http://java.sun.com/javase/6/docs/api/java/lang/String.html#String%28byte[]%29

-- 
Michael.Ludwig (#) XING.com



Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Michael Ludwig
Am 04.05.2010 um 11:09 schrieb Gisle Aas:

> I regret that I let \C sneak into the URI module.  Now we have an interface 
> that depends on the internal UTF-8 flag of the stings passed in.

Does it? How so? If it's a byte string, well, it's a byte string, and \C 
doesn't change that. If, on the other hand, it's a text string, \C forces byte 
semantics upon it. Isn't that what you want to do in that function? (Okay, 
there's no spec for that function, so I don't really know what you want to do.) 
But doesn't the function return the same result regardless of the UTF-8 flag 
being set or not? As demonstrated by this test script:

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode;
binmode STDOUT, ':utf8'; # terminal UTF-8
my $text   = 'Käse'; # all characters below 256
my $bytes  = encode_utf8 $text;
my $text2  = 'Jiří'; # some characters above 255
my $bytes2 = encode_utf8 $text2;
printf "%x %s\n", ord $_, $_ for
$text,
$text =~ m/(\C)/g,
$bytes,
$bytes =~ m/(\C)/g,
$text2,
$text2 =~ m/(\C)/g,
$bytes2,
$bytes2 =~ m/(\C)/g;


> This makes it very hard to explain, makes it not do what you want when 
> different type of strings are combined and makes it hard to fix in ways that 
> don't break some code.

Could you provide an example of how this might not do what you want when 
different types of strings are combined?

> My plan for fixing this is to introduce URI::IRI with an interface that 
> encode all non-URI characters as percent-encoded UTF-8 and live with the 
> inconsistency for URI (until Perl redefine what \C means).


-- 
Michael.Ludwig (#) XING.com