Re: [HACKERS] Careful PL/Perl Release Not Required

David E. Wheeler Fri, 11 Feb 2011 09:17:14 -0800

On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> I'd like to quibble with you over this point if I may. :-)
> Per perldoc: JSON::XS
> "utf8" flag disabled
>           When "utf8" is disabled (the default), then
> "encode"/"decode" generate and expect Unicode strings ...
> 
> So
> - If you are on < 9.1 and a utf8 database you want to pass
> utf8(false), as you have a Unicode string.


Right. That's what I realized yesterday, thanks to our exchange. I updated my 
code for that. The use of the term "Unicode string" in the JSON::XS docs is 
really confusing, though. A scalar with the utf8 flag on is not a unicode 
string. It's Perl's representation of a string. It has no encoding (it's 
"decoded").

Like I said, the terminology is awful.

> - If you are on < 9.1 and on a non utf8 database you would want to
> pass utf8(false) as the string is *not* Unicode, its byte soup. Its in
> some _other_ encoding say EUC_JP. You would need to decode() it into
> Unicode first.

Or use utf8() or utf8(1). Then JSON::XS will decode it for you.

> - If you are on 9.1 and a utf8 database you still want to pass
> utf8(false) as the string is still unicode.
> 
> - if you are on 9.1 and a non utf8 database you want to pass
> utf8(false) as the string is _now_ unicode.

Right.

> So... it seems you always want to pass false. The only case I can
> where you would want to pass true is you are on < 9.1 with a SQL_ASCII
> database and you know for a fact the string represents a utf8 byte
> sequence.
> 
> Or am I missing something obvious?

Yes, that  you can pass no value to utf8() or a true value and it will decode a 
utf-8-encoded string for you.

>>> If you do have to change your semantics/functions, could you post an
>>> example? I'd like to make sure its because you were hitting one of
>>> those nasty corner cases and not something new is broken.
>> 
>> I think that people who have non-utf-8 databases might be surprised.
> 
> Yeah, surprised it does the right thing and its actually usable now ;).

Yes, but they might need to change their code, is what I'm saying.

> 
>>>> This probably won't be that common, but Oleg, for example, will need to 
>>>> convert his fixed function from:
> 
>> No, he had to add the decode line, IIRC:
>> 
>> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>>  use strict;
>>  use URI::Escape;
>>  utf8::decode($_[0]);
>>  return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>> 
>> Because uri_unescape() needs its argument to be decoded to Perl's internal 
>> form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, 
>> in a latin-1 database:
> 
> Meh, no, not really. He will still need to call decode.

Why? In 9.1, won't params from passed to PL/Perl functions in non-SQL_ASCII 
databases already be decoded?

> The problem is
> uri_unescape() does not assume an encoding on the URI. It could be
> UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats
> not the point, it knows nothing about Unicode or encodings).

Yes, but if you don't want surprises, I think you want to pass a decoded string 
to it.

> For example, lets say you have a latin-1 accented e "é" the byte
> sequence is the one byte: 0xe9. If you were to uri_escape that you get
> the 3 byte ascii string "%E9":
> $ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)'
> %E9
> 
> If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: e9, len: 1
> 
> What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9:
> $ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)'
> %00%E9
> 
> What happens we uri_unescape that? Do we get back a Unicode string
> that has one character? No. And why should we? How is uri_unescape
> supposed to know what %00%E9 represent? All it knows is thats 2
> separate bytes:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: 00e9, len: 2

Yeah, this is why URI::Escape needs a uri_unescape_utf8() function to 
complement utf8_escape_utf8(). But to get around that, you would of course 
decode the return value yourself.

> Now, lets say you want to uri_escape a utf8 accented e, thats the two
> byte sequence: 0xc3 0xa9:
> $ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)'
> %C3%A9
> 
> Ok, what happens when we uri_unescape those?:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: c3a9, len: 2
> 
> So, plperl will also return 2 characters here.
> 
> In the the cited case he was passing "%C3%A9" to uri_unescape() and
> expecting it to return 1 character. The additional utf8::decode() will
> tell perl the string is in utf8 so it will then return 1 char. The
> point being, decode is needed and with it, the function will work pre
> and post 9.1.

Why wouldn't the string be decoded already when it's passed to the function, as 
it would be in 9.0 if the database was utf-8, and should be in 9.1 if the 
database isn't sql_ascii?

> In-fact on a latin-1 database it sure as heck better return two
> characters, it would be a bug if it only returned 1 as that would mean
> it would be treating a series of latin1 bytes as a series of utf8
> bytes!

If it's a latin-1 database, in 9.1, the argument should be passed decoded. 
That's not a utf-8 string or bytes. It's Perl's internal representation.

If I understand the patch correctly, the decode() will no longer be needed. The 
string will *already* be decoded.

Best,

David


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Careful PL/Perl Release Not Required

Reply via email to