On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote: > I'd like to quibble with you over this point if I may. :-) > Per perldoc: JSON::XS > "utf8" flag disabled > When "utf8" is disabled (the default), then > "encode"/"decode" generate and expect Unicode strings ... > > So > - If you are on < 9.1 and a utf8 database you want to pass > utf8(false), as you have a Unicode string.
Right. That's what I realized yesterday, thanks to our exchange. I updated my code for that. The use of the term "Unicode string" in the JSON::XS docs is really confusing, though. A scalar with the utf8 flag on is not a unicode string. It's Perl's representation of a string. It has no encoding (it's "decoded"). Like I said, the terminology is awful. > - If you are on < 9.1 and on a non utf8 database you would want to > pass utf8(false) as the string is *not* Unicode, its byte soup. Its in > some _other_ encoding say EUC_JP. You would need to decode() it into > Unicode first. Or use utf8() or utf8(1). Then JSON::XS will decode it for you. > - If you are on 9.1 and a utf8 database you still want to pass > utf8(false) as the string is still unicode. > > - if you are on 9.1 and a non utf8 database you want to pass > utf8(false) as the string is _now_ unicode. Right. > So... it seems you always want to pass false. The only case I can > where you would want to pass true is you are on < 9.1 with a SQL_ASCII > database and you know for a fact the string represents a utf8 byte > sequence. > > Or am I missing something obvious? Yes, that you can pass no value to utf8() or a true value and it will decode a utf-8-encoded string for you. >>> If you do have to change your semantics/functions, could you post an >>> example? I'd like to make sure its because you were hitting one of >>> those nasty corner cases and not something new is broken. >> >> I think that people who have non-utf-8 databases might be surprised. > > Yeah, surprised it does the right thing and its actually usable now ;). Yes, but they might need to change their code, is what I'm saying. > >>>> This probably won't be that common, but Oleg, for example, will need to >>>> convert his fixed function from: > >> No, he had to add the decode line, IIRC: >> >> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$ >> use strict; >> use URI::Escape; >> utf8::decode($_[0]); >> return uri_unescape($_[0]); $$ LANGUAGE plperlu; >> >> Because uri_unescape() needs its argument to be decoded to Perl's internal >> form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, >> in a latin-1 database: > > Meh, no, not really. He will still need to call decode. Why? In 9.1, won't params from passed to PL/Perl functions in non-SQL_ASCII databases already be decoded? > The problem is > uri_unescape() does not assume an encoding on the URI. It could be > UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats > not the point, it knows nothing about Unicode or encodings). Yes, but if you don't want surprises, I think you want to pass a decoded string to it. > For example, lets say you have a latin-1 accented e "é" the byte > sequence is the one byte: 0xe9. If you were to uri_escape that you get > the 3 byte ascii string "%E9": > $ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)' > %E9 > > If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9: > $ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say > sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length > $str)' > chr: é hex: e9, len: 1 > > What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9: > $ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)' > %00%E9 > > What happens we uri_unescape that? Do we get back a Unicode string > that has one character? No. And why should we? How is uri_unescape > supposed to know what %00%E9 represent? All it knows is thats 2 > separate bytes: > $ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say > sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length > $str)' > chr: é hex: 00e9, len: 2 Yeah, this is why URI::Escape needs a uri_unescape_utf8() function to complement utf8_escape_utf8(). But to get around that, you would of course decode the return value yourself. > Now, lets say you want to uri_escape a utf8 accented e, thats the two > byte sequence: 0xc3 0xa9: > $ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)' > %C3%A9 > > Ok, what happens when we uri_unescape those?: > $ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say > sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length > $str)' > chr: é hex: c3a9, len: 2 > > So, plperl will also return 2 characters here. > > In the the cited case he was passing "%C3%A9" to uri_unescape() and > expecting it to return 1 character. The additional utf8::decode() will > tell perl the string is in utf8 so it will then return 1 char. The > point being, decode is needed and with it, the function will work pre > and post 9.1. Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database was utf-8, and should be in 9.1 if the database isn't sql_ascii? > In-fact on a latin-1 database it sure as heck better return two > characters, it would be a bug if it only returned 1 as that would mean > it would be treating a series of latin1 bytes as a series of utf8 > bytes! If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's Perl's internal representation. If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded. Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers