Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

David E. Wheeler Thu, 14 Jul 2011 09:24:21 -0700

On Jul 14, 2011, at 6:23 AM, Greg Sabino Mullane wrote:

>> I find this description confusing. What is the default value for that 
>> setting? 
>> I mean, how can one know that?
> 
> There is no default: it's computed on the fly at connection time, based 
> on the server_encoding and the client_encoding.


Yeah, that's what I meant. It's difficult to comprehend how it calculates a 
value if you don't specify one.

> As the client_encoding 
> defaults to the server_encoding, the only way it can be different is 
> in the rare case that someone has set it inside of postgresql.conf. In 
> which case, we respect that and don't do any transformations at all.

There is also the PGCLIENTENCODING environment variable. 
http://www.postgresql.org/docs/9.0/static/multibyte.html#AEN30737

>> But we strongly recommend you set it explicitly to avoid confusion. And 
>> really, setting it to 1 is strongly recommended for proper and transparent 
>> handling of multibyte characters.
> 
> Yes, or some wording along the lines of "this is an expert knob, and you 
> really 
> ought to leave it alone unless you really know what you are doing".

Maybe. I'm not convinced, because if you don't set it yourself, the thing it 
decides to do may or may not be what you expect, and it would be hard to figure 
out why.

>> +DWC suggested a DBD::db attribute handle, suggested to be called
>> +"encoding" which when set would effectively pass-thru to the
>> +underlying: "SET client_encoding = $blah" and *disable* the
>> +pg_internal flag.  Specifically, by setting the encoding attribute,
>> +you are effectively indicating that you want the data from PostgreSQL
>> +back
> 
>> I like this *so* much better.
> 
> Better than? This is in addition to the above, to be clear. This is 
> basically a shortcut for someone setting pg_unicode false and issuing 
> a "SET client_encoding = 'foo'".

Unless I set it to "utf8", in which case pg_unicode would be true and 
client_encoding would be set to "UTF-8". Right?

> I'm still on the fence about making 
> such a shortcut into a formal call. The advantage is that it removes 
> the case where someone sets client_encoding manually but forgets to 
> switch pg_unicode off.

From the user's perspective, I think it makes much more sense. It says, "Here 
is what I want the encoding to be," which is easier to understand than "Should 
we or should we not convert the incoming data to Perl's internal form." Most 
people won't know WTF that means.

>> Seems to me that with pg_encoding you don't need pg_internal at all. You 
>> just have a default value for pg_encoding, which would be:
>> 
>> * If "client_encoding" is not set to its default value, DBD::Pg assumes that 
>> the choice is explicit, so use that.
>> * Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
>> * Else use "utf-8".
> 
> We still need a flag to know if we are unicoding or not. We cannot tell just 
> from a stored client_encoding.

Why not? That's what pg_unicode was figuring out on its own if you didn't set 
it.

>> +Behavior changes if pg_internal is set
>> +--------------------------------------
> 
>> Or if pg_encoding eq 'utf-8'.
> 
> No: what if someone changes the encoding later? In that case, we do *not* 
> want to unicodalize (yep, making up words left and right here) the strings 
> coming back from the database.

Yes we do, unless that encoding is SQL_ASCII. If, however, someone does *not* 
want the data decoded (or encoded when sending to the database), then yes, I 
can see where we would then need pg_unicode. But I think that pg_unicode should 
have a default value based on the setting of pg_encoding, and if pg_encoding is 
not set, it should respect the client encoding setting.

> Yeah: I'm not keen on checking the client_encoding every single time we 
> get a resultset back from the server, no matter how cheap the result. 
> As David W implies, people should use the encoding interface of suffer 
> the consequences.

Word, yo.

>> + - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
>> +   do nothing; the underlying (char*) will already be in utf-8 data.
> 
>> Maybe. utf8 ne UTF-8, quite.
> 
> Right, but it is the best we can do.

Well, no, it's not. We can encode it with Perl's API for encoding strings. 
Internally it might do nothing, but we should use that API if it's there.

>> +  - treat as latin-1/perl raw.  This may be a good default choice,
>> +    but I'm not 100% convinced; in any case we would need to
>> +    convert from raw to utf-8 using utf8::upgrade.
> 
>> I think this is basically what Perl assumes, so it's probably pretty 
>> safe. It would also be the reasonable thing to do if pg_encoding 
>> is set to something other than utf-8: you assume the user knows what 
>> she's doing and passing things in the proper encoding.
> 
> Agree with the first, but not with the second: once the user sets 
> pg_encoding, 
> we stop messing with their data, both incoming and outgoing, in the 
> expectation 
> that they have entered expert mode and want to handle things themselves. 

I disagree. I think the value of pg_encoding should be respected and things 
encoded and decoded appropriately (unless it's SQL_ASCII or pg_unicode is off).

> Or at the very least, we have separate flags for incoming and outgoing 
> tweaking.

Oy. Let's not go there yet.

>> +       a) switch client_encoding for query to the original
>> +          client_encoding, while somehow still retaining the utf-8
>> +          client encoding for result set retrieval, or,
> 
> I can't see this one working out.
> 
>> +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
>> +retrieved/created SV which does not require it;
> 
> GSM feels just as strongly we should set it on everything.

I agree.

Best,

David

Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Reply via email to