Re: [HACKERS] plperlu problem with utf8

David Christensen Sun, 19 Dec 2010 20:00:36 -0800

On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote:

> On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <[email protected]> wrote:
>> ...
>> I would argue that it should output the same as the first example. That is, 
>> PL/Perl should have decoded the latin-1 before passing the text to the Perl 
>> function.
> 
> Yeah, I don't think you will find anyone who disagrees :)  PL/TCL and
> PL/Python get this right FWIW.  Anyway find attached a patch that does
> just this.


Cool, thanks for taking this on.

> With the attached we:
> - for function arguments, convert (using pg_do_encoding_conversion) to
> utf8 from the current database encoding.  We also turn on the utf8
> flag so perl knows it was given utf8.  Pre patch things only really
> worked for SQL_ASCII or PG_UTF8 databases.  In practice everything
> worked fine for single byte charsets.  However things like uc() or
> lc() on bytes with high bits set were probably broken.

How does this deal with input records of composite type?

> - for output from perl convert from perls internal format to utf8
> (using SvPVutf8()), and then convert that to the current database
> encoding. This sounds unoptimized, but in the common case SvPVutf8()
> should be a noop.  Pre patch this was "random" (dependent on how perl
> decided to represent the string internally) but it worked 99% of the
> time (at least in the single byte charset or UTF8 cases).
> 
> - fix errors so they properly encode their message to the current
> database encoding (pre patch we were doing no conversion at all,
> similar to the output case were it worked most of the time)

This sounds good; I imagine in practice most errors contain just 7-bit ascii 
which should be acceptable in any server_encoding, but in the case where 
something is returned that is unable to be represented in the server_encoding 
(thinking values defined/used in the function itself), does it degrade to the 
current behavior, as opposed to fail or eat the error message without 
outputting?

> - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
> plperl). Pre patch this only happened on a UTF8 database.  That meant
> multi-byte character regexs were broken on non utf8 databases.

This sounds good in general, but I think should be skipped if 
GetDatabaseEncoding() == SQL_ASCII.

> -remove some old perl version checks for 5.6 and 5.8.  We require
> 5.8.1 so these were nothing but clutter.

+1.  Can't complain about removing clutter :-).

> Something interesting to note is when we are SQL_ASCII,
> pg_do_encoding_conversion() does nothing, yet we turn on the utf8
> flag.  This means if you pass in valid utf8 perl will treat it as
> such.  It also means on output it will hand utf8 back.  Both PL/Tcl
> and PL/Python do the same thing so I suppose its sensible to match
> their behavior (and it was the lazy thing to do).  The difference
> being with PL/Python if you return said string you get "ERROR:
> PL/Python: could not convert Python Unicode object to PostgreSQL
> server encoding".  While PL/Tcl and now Pl/perl give you back a utf8
> version.  For example:
> 
> (SQL_ASCII database)
> =# select length('☺');
> length
> --------
>      3
> 
> =# CREATE FUNCTION tcl_len(text) returns text as  $$ return [string
> length $1] $$ language pltcl;
> CREATE FUNCTION
> postgres=# SELECT tcl_len('☺');
> tcl_len
> ------------
> 1
> (1 row)
> 
> =# CREATE FUNCTION py_len(str text) returns text as  $$ return
> len(str) $$ language plpython3;
> =# SELECT py_len('☺');
> py_len
> --------
> 1
> (1 row)
> 
> I wouldn't really say thats right, but its at least consistent...

I think this could/should be adequately handled by not calling the function 
when the DatabaseEncoding is SQL_ASCII.  Since SQL_ASCII basically makes no 
assumptions about any representation other than arbitrary 8-bit encoding, this 
demonstrated behavior is more-or-less attaching incorrect semantics to values 
that are being returned, and is completely bunko IMHO.  (Not that many people 
are hopefully running SQL_ASCII at this point, but you never know...)  Also, 
I'd argue that pltcl and plpython should be fixed as well for the same reasons.

> This does *not* address the bytea issue where currently if you have
> bytea input or output we try to encode that the same as any string.  I
> think thats going to be a bit more invasive and this patch should
> stands on its own.
> <plperl_fix_enc.patch.gz>

Yeah, I'm not sure how invasive that will end up being, or if there are other 
datatypes which should skip the text processing.

I noticed you moved the declaration of perl_sys_init_done; was that an 
independent bug fix, or did something in the patch require that?

Cheers,

David
--
David Christensen
End Point Corporation
[email protected]





-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to