Re: [HACKERS] plperlu problem with utf8

David E. Wheeler Sat, 18 Dec 2010 19:37:29 -0800

On Dec 17, 2010, at 9:32 PM, David Christensen wrote:

> +1 on the original sentiment, but only for the case that we're dealing with 
> data that is passed in/out as arguments.  In the case that the 
> server_encoding is UTF-8, this is as trivial as a few macros on the 
> underlying SVs for text-like types.  If the server_encoding is SQL_ASCII (= 
> byte soup), this is a trivial case of doing nothing with the conversion 
> regardless of data type.  For any other server_encoding, the data would need 
> to be converted from the server_encoding to UTF-8, presumably using the 
> built-in conversions before passing it off to the first code path.  A similar 
> handling would need to be done for the return values, again 
> datatype-dependent.


+1

> Recent upgrades of the Encode module included with perl 5.10+ have caused 
> issues wherein circular dependencies between Encode and Encode::Alias have 
> made it impossible to load in a Safe container without major pain.  (There 
> may be some better options than I'd had on a previous project, given that 
> we're embedding our own interpreters and accessing more through the XS guts, 
> so I'm not ruling out this possibility completely).

Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm.

>> Well that works for me. I always use UTF8. Oleg, what was the encoding of 
>> your database where you saw the issue?
> 
> I'm not sure what the current plperl runtime does as far as marshaling this, 
> but it would be fairly easy to ensure the parameters came in in perl's 
> internal format given a server_encoding of UTF8 and some type introspection 
> to identify the string-like types/text data.  (Perhaps any type which had a 
> binary cast to text would be a sufficient definition here.  Do domains 
> automatically inherit binary casts from their originating types?) 

Their labels are TEXT. I believe that the only type that should not be treated 
as text is bytea.

>>> 2) its not utf8, so we just leave it as octets.
>> 
>> Which mean's Perl will assume that it's Latin-1, IIUC.
> 
> This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed 
> out earlier.  This would produce bogus results for any non-UTF-8, non-ASCII, 
> non latin-1 encoding, even if it did not generally bite most people in 
> general usage.

Agreed.

> This example seems bogus; wouldn't length be 3 if this is the example text 
> this was run with?  Additionally, since all ASCII is trivially UTF-8, I think 
> a better example would be using a string with hi-bit characters so if this 
> was improperly handled the lengths wouldn't match; length($all_ascii) == 
> length(encode_utf8($all_ascii)) vs length($hi_bit) < 
> length(encode_utf8($hi_bit)).  I don't see that this test shows us much with 
> the test case as given.  The is_utf8() function merely returns the state of 
> the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not 
> be set on ascii-only strings, which are still valid in the UTF-8 encoding), 
> nor does it indicate that there are no hi-bit characters in the string (i.e., 
> with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in 
> perl's internal format) with hi-bit characters will have the utf8 flag set, 
> but the return value of encode_utf8 will not, even though the underlying 
> data, as represented in perl will be identical).


Sorry, I probably had a pasto there. how about this?

    CREATE OR REPLACE FUNCTION perlgets(
        TEXT
    ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
       my $text = shift;
       return_next {
           length  => length $text,
           is_utf8 => utf8::is_utf8($text) ? 1 : 0
       };
    $$;

    utf8=# SELECT * FROM perlgets('“hello”');
     length │ is_utf8 
    ────────┼─────────
          7 │ t

    latin=# SELECT * FROM perlgets('“hello”');
     length │ is_utf8 
    ────────┼─────────
         11 │ f

(Yes I used Latin-1 curly quotes in that last example). I would argue that it 
should output the same as the first example. That is, PL/Perl should have 
decoded the latin-1 before passing the text to the Perl function.

> 
>> In a latin-1 database:
>> 
>>   latin=# select * from perlgets('foo');
>>    length │ is_utf8 
>>   ────────┼─────────
>>         8 │ f
>>   (1 row)
>> 
>> I would argue that in the latter case, is_utf8 should be true, too. That is, 
>> PL/Perl should decode from Latin-1 to Perl's internal form.
> 
> See above for discussion of the is_utf8 flag; if we're dealing with latin-1 
> data or (more precisely in this case) data that has not been decoded from the 
> server_encoding to perl's internal format, this would exactly be the 
> expectation for the state of that flag.

Right. I think that it *should* be decoded.

>> Interestingly, when I created a function that takes a bytea argument, utf8 
>> was *still* enabled in the utf-8 database. That doesn't seem right to me.
> 
> I'm not sure what you mean here, but I do think that if bytea is identifiable 
> as one of the input types, we should do no encoding on the data itself, which 
> would indicate that the utf8 flag for that variable would be unset.  

Right.

> If this is not currently handled this way, I'd be a bit surprised, as bytea 
> should just be an array of bytes with no character semantics attached to it.

It looks as though it is not handled that way. The utf8 flag *is* set on a 
bytea string passed to a PL/Perl function in a UTF-8 database.

> As shown above, the character length for the example should be 27, while the 
> octet length for the UTF-8 encoded version is 28.  I've reviewed the source 
> of URI::Escape, and can say definitively that: a) regular uri_escape does not 
> handle > 255 code points in the encoding, but there exists a uri_escape_utf8 
> which will convert the source string to UTF8 first and then escape the 
> encoded value, and b) uri_unescape has *no* logic in it to automatically 
> decode from UTF8 into perl's internal format (at least as far as the version 
> that I'm looking at, which came with 5.10.1).

Right.

> -1; if you need to decode from an octets-only encoding, it's your 
> responsibility to do so after you've unescaped it.  Perhaps later versions of 
> the URI::Escape module contain a uri_unescape_utf8() function, but it's 
> trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. 
>  This is definitely not a bug in uri_escape, as it is only defined to return 
> octets.

Right, I think we're agreed on that count. I wouldn't mind seeing a 
uri_unescape_utf8() though, as it might prevent some confusion.

>>> Yeah, the patch address this part.  Right now we just spit out
>>> whatever the internal format happens to be.
>> 
>> Ah, excellent.
> 
> I agree with the sentiments that: data (server_encoding) -> function 
> parameters (-> perl internal) -> function return (-> server_encoding).  This 
> should be for any character-type data insofar as it is feasible, but ISTR 
> there is already datatype-specific marshaling occurring.

Dunno about that.

> There is definitely a lot of confusion surrounding perl's handling of 
> character data; I hope this was able to clear a few things up.

Yes, it helped, thanks!

David



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to