On Feb 10, 2011, at 5:28 PM, Alex Hunsaker wrote:

> Hrm? For UTF-8 databases, in practice, nothing should have changed--
> we already passed strings in as utf8. What I fixed was some corner
> cases where some strings did not always have character semantics. See
> The "Unicode Bug" and "Forcing Unicode in Perl" in perldoc perlunicode
> for the problem and more or less how I fixed it.

Uh…

try=# create function is_utf8(text) returns boolean language plperl AS 
'utf8::is_utf8(shift)';
CREATE FUNCTION

try=# select is_utf8('whatever');
 is_utf8 
─────────
 t
(1 row)

try=# select is_utf8(U&'\0441\043B\043E\043D');
 is_utf8 
─────────
 t
(1 row)

Damn, I guess you're right. How did I miss that?

> The other thing that changed is non UTF-8 databases now also get
> character semantics. That is we convert from the database encoding
> into utf8 and visa versa on output. That probably should be noted
> somewhere...

Oh. I see. And Oleg's database wasn't utf-8 then, I guess. I'll have to re-read 
the JSON docs, I guess. Erm…feh. Okay. I have to pass the false value to utf8() 
*now*. Okay, at least that's more consistent.

> If you do have to change your semantics/functions, could you post an
> example? I'd like to make sure its because you were hitting one of
> those nasty corner cases and not something new is broken.

I think that people who have non-utf-8 databases might be surprised.

>> This probably won't be that common, but Oleg, for example, will need to 
>> convert his fixed function from:
>> ...
> 
> Well assuming he fixed his bug by encoding uri_unescape's output he
> should not have to do anything.  IIRC the problem was basically double
> encoded utf8, not a postgres bug.

No, the problem was that the string was passed to his pl/perl function encoded 
in utf-8. He added a line to decode it to Perl's internal form. Once he goes to 
9.1, unless the database is SQL_ASCII, he can dump the decode() line. I think.

> [ he had %3A%4A or something, uri_decode() decodes that to _two_
> characters because _it_ knows nothing about utf8. so you would need to
> call utf8::decode() on the result to turn those two bytes into a
> character ]

No, he had to add the decode line, IIRC:

CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
 use strict;
 use URI::Escape;
 utf8::decode($_[0]);
 return uri_unescape($_[0]); $$ LANGUAGE plperlu;

Because uri_unescape() needs its argument to be decoded to Perl's internal 
form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, in 
a latin-1 database:

latin=# create or replace function is_utf8(text) returns boolean language 
plperl AS 'utf8::is_utf8(shift) ? 1 : 0';
CREATE FUNCTION
Time: 1.934 ms
latin=# select is_utf8('whatever');                                             
            is_utf8 
─────────
 f
(1 row)

That will change, if I understand correctly.


>> So this needs to be highlighted in the release notes: If a PL/Perl function 
>> is currently relying on a parameter passed in bytes, it will >need to be 
>> modified to deal with utf8 strings, instead.
> 
> FYI Andrew did add some docs.

Yeah, I was thinking of the release notes. Those who have non-uft-8 databases 
might be surprised if their PL/Perl functions expect strings to be passed as 
bytes.

> Thanks for keeping a sharp eye out.
> 
> [ P.S. This stuff is confusing as hell, im just glad I got a sucker to
> commit it *waves* at Andrew :-) ]

Heh, well done. Frankly, though, this stuff isn't *that* hard. It's Perl's 
terminology  that's really bad. 

Best,

David
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to