On Dec 17, 2010, at 9:32 PM, David Christensen wrote:
> +1 on the original sentiment, but only for the case that we're dealing with
> data that is passed in/out as arguments. In the case that the
> server_encoding is UTF-8, this is as trivial as a few macros on the
> underlying SVs for text-like types. If the server_encoding is SQL_ASCII (=
> byte soup), this is a trivial case of doing nothing with the conversion
> regardless of data type. For any other server_encoding, the data would need
> to be converted from the server_encoding to UTF-8, presumably using the
> built-in conversions before passing it off to the first code path. A similar
> handling would need to be done for the return values, again
> datatype-dependent.
+1
> Recent upgrades of the Encode module included with perl 5.10+ have caused
> issues wherein circular dependencies between Encode and Encode::Alias have
> made it impossible to load in a Safe container without major pain. (There
> may be some better options than I'd had on a previous project, given that
> we're embedding our own interpreters and accessing more through the XS guts,
> so I'm not ruling out this possibility completely).
Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm.
>> Well that works for me. I always use UTF8. Oleg, what was the encoding of
>> your database where you saw the issue?
>
> I'm not sure what the current plperl runtime does as far as marshaling this,
> but it would be fairly easy to ensure the parameters came in in perl's
> internal format given a server_encoding of UTF8 and some type introspection
> to identify the string-like types/text data. (Perhaps any type which had a
> binary cast to text would be a sufficient definition here. Do domains
> automatically inherit binary casts from their originating types?)
Their labels are TEXT. I believe that the only type that should not be treated
as text is bytea.
>>> 2) its not utf8, so we just leave it as octets.
>>
>> Which mean's Perl will assume that it's Latin-1, IIUC.
>
> This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed
> out earlier. This would produce bogus results for any non-UTF-8, non-ASCII,
> non latin-1 encoding, even if it did not generally bite most people in
> general usage.
Agreed.
> This example seems bogus; wouldn't length be 3 if this is the example text
> this was run with? Additionally, since all ASCII is trivially UTF-8, I think
> a better example would be using a string with hi-bit characters so if this
> was improperly handled the lengths wouldn't match; length($all_ascii) ==
> length(encode_utf8($all_ascii)) vs length($hi_bit) <
> length(encode_utf8($hi_bit)). I don't see that this test shows us much with
> the test case as given. The is_utf8() function merely returns the state of
> the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not
> be set on ascii-only strings, which are still valid in the UTF-8 encoding),
> nor does it indicate that there are no hi-bit characters in the string (i.e.,
> with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in
> perl's internal format) with hi-bit characters will have the utf8 flag set,
> but the return value of encode_utf8 will not, even though the underlying
> data, as represented in perl will be identical).
Sorry, I probably had a pasto there. how about this?
CREATE OR REPLACE FUNCTION perlgets(
TEXT
) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
my $text = shift;
return_next {
length => length $text,
is_utf8 => utf8::is_utf8($text) ? 1 : 0
};
$$;
utf8=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
7 │ t
latin=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
11 │ f
(Yes I used Latin-1 curly quotes in that last example). I would argue that it
should output the same as the first example. That is, PL/Perl should have
decoded the latin-1 before passing the text to the Perl function.
>
>> In a latin-1 database:
>>
>> latin=# select * from perlgets('foo');
>> length │ is_utf8
>> ────────┼─────────
>> 8 │ f
>> (1 row)
>>
>> I would argue that in the latter case, is_utf8 should be true, too. That is,
>> PL/Perl should decode from Latin-1 to Perl's internal form.
>
> See above for discussion of the is_utf8 flag; if we're dealing with latin-1
> data or (more precisely in this case) data that has not been decoded from the
> server_encoding to perl's internal format, this would exactly be the
> expectation for the state of that flag.
Right. I think that it *should* be decoded.
>> Interestingly, when I created a function that takes a bytea argument, utf8
>> was *still* enabled in the utf-8 database. That doesn't seem right to me.
>
> I'm not sure what you mean here, but I do think that if bytea is identifiable
> as one of the input types, we should do no encoding on the data itself, which
> would indicate that the utf8 flag for that variable would be unset.
Right.
> If this is not currently handled this way, I'd be a bit surprised, as bytea
> should just be an array of bytes with no character semantics attached to it.
It looks as though it is not handled that way. The utf8 flag *is* set on a
bytea string passed to a PL/Perl function in a UTF-8 database.
> As shown above, the character length for the example should be 27, while the
> octet length for the UTF-8 encoded version is 28. I've reviewed the source
> of URI::Escape, and can say definitively that: a) regular uri_escape does not
> handle > 255 code points in the encoding, but there exists a uri_escape_utf8
> which will convert the source string to UTF8 first and then escape the
> encoded value, and b) uri_unescape has *no* logic in it to automatically
> decode from UTF8 into perl's internal format (at least as far as the version
> that I'm looking at, which came with 5.10.1).
Right.
> -1; if you need to decode from an octets-only encoding, it's your
> responsibility to do so after you've unescaped it. Perhaps later versions of
> the URI::Escape module contain a uri_unescape_utf8() function, but it's
> trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}.
> This is definitely not a bug in uri_escape, as it is only defined to return
> octets.
Right, I think we're agreed on that count. I wouldn't mind seeing a
uri_unescape_utf8() though, as it might prevent some confusion.
>>> Yeah, the patch address this part. Right now we just spit out
>>> whatever the internal format happens to be.
>>
>> Ah, excellent.
>
> I agree with the sentiments that: data (server_encoding) -> function
> parameters (-> perl internal) -> function return (-> server_encoding). This
> should be for any character-type data insofar as it is feasible, but ISTR
> there is already datatype-specific marshaling occurring.
Dunno about that.
> There is definitely a lot of confusion surrounding perl's handling of
> character data; I hope this was able to clear a few things up.
Yes, it helped, thanks!
David
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers