Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

David E. Wheeler Thu, 07 Jul 2011 22:39:28 -0700

David & Greg,

Apologies for the delayed reply here. I wanted a chance to really read through 
this stuff carefully.


On Jun 28, 2011, at 3:31 PM, [email protected] wrote:

> Committed by David Christensen <[email protected]>
> 
> Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM
> 
> ---
> TODO.utf8 |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 161 insertions(+), 0 deletions(-)
> 
> diff --git a/TODO.utf8 b/TODO.utf8
> new file mode 100644
> index 0000000..5260bac
> --- /dev/null
> +++ b/TODO.utf8
> @@ -0,0 +1,161 @@
> +Summary of design changes from discussions with GSM and DWC re: utf-8 in 
> DBD::Pg
> +================================================================================
> +
> +Behavior of the pg_unicode/pg_utf8_strings connection attribute
> +---------------------------------------------------------------
> +We will utilize a connect attribute (enabled by default) to enable the
> +use of an immediate SET client_encoding.  The current name of this is
> +"pg_utf8_strings", but DWC prefers something non-encoding specific;
> +examples wanted, but "pg_unicode" or "pg_internal" seem best.

pg_decode_strings. Or pg_encode_strings, depending on how you look at it.

> +If the "pg_internal" attribute is explicitly provided in the DBI
> +connect attributes it will be one of (0, 1), to enable/disable the
> +pg_internal behavior explicitly.  If not provided, we check the
> +initial "server_encoding" and "client_encoding" settings.
> +
> +The logic for setting "pg_internal" when unspecified is:
> +
> + - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
> +
> + - If "client_encoding" <> "server_encoding", or perhaps better yet if
> +   the pg_setting("client_encoding") returns a different value than
> +   the default version for that setting, then we assuming that the
> +   client encoding choice is *explicit* and the user will be wanting
> +   to get raw octets back from DBI, thus set pg_internal to 0.

I find this description confusing. What is the default value for that setting? 
I mean, how can one know that?

Assuming one can, I suggest alternate phrasing:

- If "client_encoding" is not set to its default value, DBD::Pg assumes that 
the choice is explicit, so pg_internal is false.

> + - Otherwise set pg_internal to 1.

But we strongly recommend you set it explicitly to avoid confusion. And really, 
setting it to 1 is strongly recommended for proper and transparent handling of 
multibyte characters.

> +
> +Immediately after the connection initialization completes, we will
> +check for the set pg_internal flag; if set, we issue a "SET
> +client_encoding TO 'utf-8'" and commit.

Sounds sensible.

> +
> +
> +Proposal for an "encoding" DBD attribute interface
> +--------------------------------------------------
> +
> +DWC suggested a DBD::db attribute handle, suggested to be called
> +"encoding" which when set would effectively pass-thru to the
> +underlying: "SET client_encoding = $blah" and *disable* the
> +pg_internal flag.  Specifically, by setting the encoding attribute,
> +you are effectively indicating that you want the data from PostgreSQL
> +back

I like this *so* much better.

> +
> +If such a mechanism *was* instituted, we could utilize `pg_encoding =>
> +'blah'` as the connection-level attribute and just tie the underlying
> +implementation of the pg_internal mechanism to this, by having a
> +keyword ('internal') as the special-case encoding, which could be
> +enabled/disabled via $dbh->{pg_encoding} = 'internal';

WTF is internal?

Seems to me that with pg_encoding you don't need pg_internal at all. You just 
have a default value for pg_encoding, which would be:

* If "client_encoding" is not set to its default value, DBD::Pg assumes that 
the choice is explicit, so use that.
* Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
* Else use "utf-8".

> +
> +This would allow us to pass-through utf-8 *without* setting the SvUTF8
> +flag by setting $dbh->{pg_encoding} = 'utf-8'.

+1. And the fewer of these options the better, IMHO.

> +Behavior changes if pg_internal is set
> +--------------------------------------

Or if pg_encoding eq 'utf-8'.

> +There will be two distinct changes that need to take place,
> +specifically input and output.
> +
> +When processing the result sets returned by the server, if pg_internal
> +is set, we can either fiat that the "client_encoding" is set to UTF-8
> +as it was originally when we switched it on connection, or verify that
> +the libpq's result set charset/encoding is equal to UTF-8.  I believe
> +this is available as an int, which could be cached when we do the
> +original "SET client_encoding" and/or initial setup tests, which
> +should prevent accidental corruption.

Or just strongly recommend that if you want to change it, set pg_encoding 
instead of executing SET CLIENT_ENCODING.

> + - if we decide to go this route and detect the charset change, we can
> +   issue a notice/warning from DBD::Pg that the client_encoding has
> +   changed and then turn off the pg_internal flag.

But only if pg_internal was not explicitly set by the user, right?

> + - if everything checks out, we use the usual dequote_* methods and
> +   set the SvUTF8 flag on either text-based bytes, or set only on the
> +   ASCII datums.
> +
> + - a possible option to benchmark would be to directly use the
> +   "utf8::upgrade" method from the perl internals (or some Sv-creation
> +   method based on (char*)) to take advantage of any perl-specific
> +   enhancements already in place.  This may be just as fast since perl
> +   already needs to copy the (char*) contents into the SV, and may
> +   already have fast-tracked code-paths for this type of operation,
> +   since we know the data will be valid UTF8.
> +
> +When processing data coming *in* from the user i.e., (SV*) we consider
> +the following:
> +
> + - if pg_internal is 0, pass through the normal methods unabashed.
> +
> + - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
> +   do nothing; the underlying (char*) will already be in utf-8 data.

Maybe. utf8 ne UTF-8, quite.

> + - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
> +   special consideration for hi-bit characters; since we've
> +   effectively co-opted the expected client_encoding and forced UTF8,
> +   we need to treat the raw data as octets.  We have a couple choices:
> +
> +     - treat as latin-1/perl raw.  This may be a good default choice,
> +       but I'm not 100% convinced; in any case we would need to
> +       convert from raw to utf-8 using utf8::upgrade.

I think this is basically what Perl assumes, so it's probably pretty safe. It 
would also be the reasonable thing to do if pg_encoding is set to something 
other than utf-8: you assume the user knows what she's doing and passing things 
in the proper encoding.

> +
> +     - treat as original client_encoding.  This may be the least
> +       changed expectation as far as the user is concerned, but
> +       requires us to either:
> +
> +       a) switch client_encoding for query to the original
> +          client_encoding, while somehow still retaining the utf-8
> +          client encoding for result set retrieval, or,
> +
> +       b) actually use Encode to transcode from the original
> +          client_encoding to UTF8.  I think GSM is particularly
> +          against bringing Encode into the picture just due to
> +          additional complexity issues.

To me, this is just more reason to use pg_encoding and not have pg_internal at 
all.

> +
> +Implementation considerations/ideas
> +-----------------------------------
> +
> +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
> +retrieved/created SV which does not require it;

Why?

> as such, an operation
> +that can quickly check whether there are any hi-bit characters in a
> +given (char*) would need to be weighed against the possible
> +inconvenience of *always* setting the SvUTF8 flag on eligible strings,
> +regardless of whether it is full ASCII.

Yeah, needs benchmarking. And if it's slow and you still want it, maybe give us 
a knob to turn it off.

Best,

David

Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Reply via email to