Re: [HACKERS] plperlu problem with utf8

Alex Hunsaker Fri, 17 Dec 2010 22:44:37 -0800

On Fri, Dec 17, 2010 at 22:32, David Christensen <da...@endpoint.com> wrote:
>
> On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:
>
>> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>>
>>>> No, URI::Escape is fine. The issue is that if you don't decode text to 
>>>> Perl's internal form, it assumes that it's Latin-1.
>>>
>>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>>
>> Not knowing what those mean, I'm not saying either one, to my knowledge. 
>> What I understand, however, is that Perl, given a scalar with bytes in it, 
>> will treat it as latin-1 unless the utf8 flag is turned on.
>
> This is a correct assertion as to Perl's behavior.  As far as PostgreSQL 
> is/should be concerned in this case, this is the correct handling for 
> URI::Escape,


Right, so no postgres bug here..  Postgres showing Ã© instead of é is
right as far as its concerned.

>> PostgreSQL should do everything it can to decode to Perl's internal format 
>> before passing arguments, and to decode from Perl's internal format on 
>> output.
>
> +1 on the original sentiment, but only for the case that we're dealing with 
> data that is passed in/out as arguments.  In the case that the 
> server_encoding is UTF-8, this is as trivial as a few macros on the 
> underlying SVs for text-like types.  If the server_encoding is SQL_ASCII (= 
> byte soup), this is a trivial case of doing nothing with the conversion 
> regardless of data type.

Right and thats what we do for the above.  Minus some mis-handling of
non character datatypes like bytea in the UTF-8 case.

> For any other server_encoding, the data would need to be converted from the 
> server_encoding to UTF-8, presumably using the built-in conversions before 
> passing it off to the first code path.  A similar handling would need to be 
> done for the return values, again datatype-dependent.

Yeah, thats what we *should* do.  Right now we just leave it as byte
soup for the user to decode/encode. :(

> [ correctness of perl character ops in the non utf8 case] One thought I had 
> was that we could expose the server_encoding to the plperl interpreters in a 
> special variable to make it easy to explicitly decode...

Should not need to do anything as complicated as that. Can just encode
the string to utf8 before we hand it off to perl.

[...]
> $ perl -MURI::Escape -e'print 
> length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
> 28
>
> $ perl -MEncode -MURI::Escape -e'print 
> length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
> 27
[...]
> As shown above, the character length for the example should be 27, while the 
> octet length for the UTF-8 encoded version is 28.  I've reviewed the source 
> of URI::Escape, and can say definitively that: a) regular uri_escape does not 
> handle > 255 code points in the encoding, but there exists a uri_escape_utf8 
> which will convert the source string to UTF8 first and then escape the 
> encoded value, and

And why should it? properly escaped URIs should have all those
escaped, I imagine.  Anyway not really relevant for postgres.

> b) uri_unescape has *no* logic in it to automatically decode from UTF8 into 
> perl's internal format (at least as far as the version that I'm looking at, 
> which came with 5.10.1).

>>> Either uri_unescape() should be decoding that utf8() or you need
>>> to do it *after* you call uri_unescape().  Hence the maybe it could be
>>> considered a bug in uri_unescape().
>>
>> Agreed.
>
> -1; if you need to decode from an octets-only encoding, it's your 
> responsibility to do so after you've unescaped it.

-1? thats basically what I said:  "... you need to do it (decode the
utf8) *after* you call uri_unescape"

>  Perhaps later versions of the URI::Escape module contain a 
> uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { 
> Encode::decode_utf8(uri_unescape(shift))}.  This is definitely not a bug in 
> uri_escape, as it is only defined to return octets.

Ahh So -1 because I said maybe you could call it a bug in
uri_unescape(). Really, I was only saying you *might* be able to
consider it a bug-- or perhaps deficiency is a better word, in
uri_unescape iff URI's are defined to have escaped characters as a %
escaped utf8 sequence.  I dont know that they do, so I don't know if
its a bug :)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to