Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

David E. Wheeler Sun, 17 Jul 2011 18:16:43 -0700

On Jul 17, 2011, at 11:11 AM, Greg Sabino Mullane wrote:

> Well, it will set it to UTF-8, unless there is a really good reason not to. 
> And the only exceptions are SQL_ASCII and if they went out of their way to 
> set the client encoding themselves, in which case it would be rude of us 
> to change it back on them. :)


Okay, put that way I understand it. I think that should be the introductory 
paragraph, followed by a bulleted list explaining the situations in which it 
would be off.

>>> Better than? This is in addition to the above, to be clear. This is 
>>> basically a shortcut for someone setting pg_unicode false and issuing 
>>> a "SET client_encoding = 'foo'".
> 
>> Unless I set it to "utf8", in which case pg_unicode would be true and 
>> client_encoding would be set to "UTF-8". Right?
> 
> Right. Although in most cases that will be a no-op as those will already 
> be set that way. Although a weak case could be argued that setting it 
> to UTF-8 via the interface should turn pg_unicodde *off*, to be consistent.
> But I think that's all the more reason for a separate knob, and one of the 
> reasons I'm only lukewarm to the whole $h->{encoding} thing.

I think that setting pg_encoding should always turn pg_unicode *on*.

>> From the user's perspective, I think it makes much more sense. It says, 
>> "Here is what I want the encoding to be," which is easier to understand 
>> than "Should we or should we not convert the incoming data to Perl's 
>> internal form." Most people won't know WTF that means.
> 
> Yeah, that's true. On the other hand, even the encoding setting is meant 
> as sort of an expert knob.

Maybe. I think a lot of existing installations may find they need to turn it 
off, unless they had been using pg_enable_utf8 before.

>>> We still need a flag to know if we are unicoding or not. We cannot tell 
>>> just 
>>> from a stored client_encoding.
> 
>> Why not? That's what pg_unicode was figuring out on its own if you didn't 
>> set it.
> 
> Yes, but once we call $h->{encoding}, we need to track both the encoding and 
> the fact that we are decoding or not. Which could be either way. Which raises 
> a point: if we need a way to get things back to "normal" after the user 
> sets $h->{encoding} to something weird, presumably they would then call 
> $h->{encoding} = UTF-8. So perhaps that answers the above: we turn pg_unicode 
> *on* in that case. But it still means that there is no way for someone to 
> want a UTF-8 client_encoding but do NOT want us to decode things. Sigh.

I think that setting pg_encoding should turn on pg_unicode, unless it's set to 
:raw or something. Then someone could always explicitly set both to make it do 
what they mean.

> (some more of the same arguments trimmed from your reply)

Yeah, sorry. :-)

>>> Or at the very least, we have separate flags for incoming and outgoing 
>>> tweaking.
> 
>> Oy. Let's not go there yet.
> 
> How about now? :) The problem is that people have existing scripts that we 
> don't 
> want to fail, and are trying to shove who-knows-what into the database, so we 
> definitely want to clean up their mess as it comes in, but give them the 
> option 
> not to mess with it in case that is what they need. I think that should be a 
> separate 
> knob from the stuff coming back from the database. To put another way, I'm 
> happy 
> linking the two together for most things but providing an expert knob just in 
> case 
> they need it that can de-couple them.

Oh I agree, I just think it's worth putting off until this other stuff gets 
sorted out.

> I'm trying to make this as bulletproof as possible so that we break as few 
> existing 
> scripts as possible on the first release, and allow as much fine-tuning as 
> needed 
> from the get-go, since we cannot know what will really break or the strange 
> combinations 
> people will want until this is released in the wild.

The truth is, unless we pay attention to what pg_enable_utf8 was set to in such 
scripts -- and if it was set -- then suddenly having stuff be encoded and 
decoded when it wasn't before may surprise some folks. It *shouldn't*, but it 
will be different than what it was doing before.

Have you asked Tim Bunce about any of this stuff? I know he has thought about 
adding encoding knobs to the DBI core, but I don't know how far a long he got 
in thinking about a design.

Best,

David

Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Reply via email to