On Thu, Sep 25, 2008 at 5:57 AM, Bill Moseley <[EMAIL PROTECTED]> wrote:
> On Sun, Sep 21, 2008 at 08:36:32AM +0200, Gisle Aas wrote:
>> The issue with dropped chars has been fixed so I don't worry about
>> that.  Just upgrade the URI module.
>>
>> The remaining issue is if $url->query_form should accept Unicode data
>> and automatically UTF-8 encode it as it does now.  When I accepted
>> that patch I though it would be harmless as this provide a convenience
>> for some at the same time as it does not change anything for users
>> that properly encode their data before passing it to this API. What's
>> problematic is that this strengthens the idea that the UTF-8 flag has
>> semantic meaning at the Perl level.  Strings with chars in the range
>> 128-255 behave differently depending on the internal representation.
>> I'm not happy about that.  It's certainly not my idea of a sane
>> Unicode model.
>>
>> To me that leaves 2 options; either make the URI API strict and only
>> accept args that are bytes (strings that can be utf8::downgraded) or
>> just live with the ugliness of inconsistent Unicode model and try to
>> document the issues better over time. I'm leaning towards the later.
>
> Sorry, kind of got stuck behind work here.
>
> So, in my situation I need to post some utf8 characters.  The service
> I'm using requires an ?encoding=utf8 query parameter to say what
> encoding the text is encoded in.  The post doesn't include
> a charset:
>
>    Content-Type: application/x-www-form-urlencoded
>
> So it seems the server needs to be explicitly told.
>
>
> The problem I had was if I passed in a character string (utf8 flag on)
> then the url-encoding process dropped chars.  You say that has been
> fixed.  I fixed on my side by simply calling encode_utf8 to convert my
> character string into octets.  Then all octets were url-encoded and
> passed to the server and all works fine.

Yes and that will always continue to work.  If you encode the strings
yourself things behave consistently and you select what encoding to
use.

If you mix encoded (byte) strings and Unicode strings bad things
happens.  If you only use Unicode strings (and at least one of them
has the utf8-flag set or none of them has chars above 127) you should
get UTF-8 encoded output.

> Now, here's my question.  Could I pass in any byte (octet) string and
> have it url-encoded?

Yes.  URI->query_form just encodes the bytes asis.

>  Do the url-encoded post parameters have to be of
> a given character encoding or is that just an agreement between the
> sender and receiver?

There certainly has to be agreement between the sender and the
receiver.  I thought the normal behaviour was to encode using the same
encoding as the document the form was embedded in uses.

> That is, can I encode my character string into any character
> encoding and send it url-encoded?  Then as long as the server
> receiving the post knows how to decoded (using same encoding I used)
> then it would be fine?

Right.

> If that's the case then it would seem like query_param should die if
> it receives any strings with the utf8 flag on.  You can't encode_utf8
> or utf8::downgrade because we don't know what (octet) encoding that
> the sender and receiver agreed on.

I basically agree with that view.  It can still be convenient to have
it assume UTF-8 encoding in this case, and there is the potential that
introducing this strictness breaks code.

It could also be argued that it might be helpful to break such code
because it has the potential of already being broken. Consider this
example:

#!perl -wl

use URI;
use charnames ':full';

$u = URI->new("http://www.example.com";);
$u->query_form(foo => "bål");
print $u;

$u->query_form(foo => "bål", bar => "\N{WATCH}");
print $u;

__END__

which prints:

  http://www.example.com?foo=b%E5l
  http://www.example.com?foo=b%C3%A5l&bar=%E2%8C%9A

Here the encoding of the first parameter depends on the presence of
the second parameter which is clearly not a good thing.

--Gisle

Reply via email to