David & Greg, Apologies for the delayed reply here. I wanted a chance to really read through this stuff carefully.
On Jun 28, 2011, at 3:31 PM, [email protected] wrote: > Committed by David Christensen <[email protected]> > > Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM > > --- > TODO.utf8 | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 161 insertions(+), 0 deletions(-) > > diff --git a/TODO.utf8 b/TODO.utf8 > new file mode 100644 > index 0000000..5260bac > --- /dev/null > +++ b/TODO.utf8 > @@ -0,0 +1,161 @@ > +Summary of design changes from discussions with GSM and DWC re: utf-8 in > DBD::Pg > +================================================================================ > + > +Behavior of the pg_unicode/pg_utf8_strings connection attribute > +--------------------------------------------------------------- > +We will utilize a connect attribute (enabled by default) to enable the > +use of an immediate SET client_encoding. The current name of this is > +"pg_utf8_strings", but DWC prefers something non-encoding specific; > +examples wanted, but "pg_unicode" or "pg_internal" seem best. pg_decode_strings. Or pg_encode_strings, depending on how you look at it. > +If the "pg_internal" attribute is explicitly provided in the DBI > +connect attributes it will be one of (0, 1), to enable/disable the > +pg_internal behavior explicitly. If not provided, we check the > +initial "server_encoding" and "client_encoding" settings. > + > +The logic for setting "pg_internal" when unspecified is: > + > + - If "server_encoding" is "SQL_ASCII" set pg_internal to 0. > + > + - If "client_encoding" <> "server_encoding", or perhaps better yet if > + the pg_setting("client_encoding") returns a different value than > + the default version for that setting, then we assuming that the > + client encoding choice is *explicit* and the user will be wanting > + to get raw octets back from DBI, thus set pg_internal to 0. I find this description confusing. What is the default value for that setting? I mean, how can one know that? Assuming one can, I suggest alternate phrasing: - If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so pg_internal is false. > + - Otherwise set pg_internal to 1. But we strongly recommend you set it explicitly to avoid confusion. And really, setting it to 1 is strongly recommended for proper and transparent handling of multibyte characters. > + > +Immediately after the connection initialization completes, we will > +check for the set pg_internal flag; if set, we issue a "SET > +client_encoding TO 'utf-8'" and commit. Sounds sensible. > + > + > +Proposal for an "encoding" DBD attribute interface > +-------------------------------------------------- > + > +DWC suggested a DBD::db attribute handle, suggested to be called > +"encoding" which when set would effectively pass-thru to the > +underlying: "SET client_encoding = $blah" and *disable* the > +pg_internal flag. Specifically, by setting the encoding attribute, > +you are effectively indicating that you want the data from PostgreSQL > +back I like this *so* much better. > + > +If such a mechanism *was* instituted, we could utilize `pg_encoding => > +'blah'` as the connection-level attribute and just tie the underlying > +implementation of the pg_internal mechanism to this, by having a > +keyword ('internal') as the special-case encoding, which could be > +enabled/disabled via $dbh->{pg_encoding} = 'internal'; WTF is internal? Seems to me that with pg_encoding you don't need pg_internal at all. You just have a default value for pg_encoding, which would be: * If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so use that. * Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII". * Else use "utf-8". > + > +This would allow us to pass-through utf-8 *without* setting the SvUTF8 > +flag by setting $dbh->{pg_encoding} = 'utf-8'. +1. And the fewer of these options the better, IMHO. > +Behavior changes if pg_internal is set > +-------------------------------------- Or if pg_encoding eq 'utf-8'. > +There will be two distinct changes that need to take place, > +specifically input and output. > + > +When processing the result sets returned by the server, if pg_internal > +is set, we can either fiat that the "client_encoding" is set to UTF-8 > +as it was originally when we switched it on connection, or verify that > +the libpq's result set charset/encoding is equal to UTF-8. I believe > +this is available as an int, which could be cached when we do the > +original "SET client_encoding" and/or initial setup tests, which > +should prevent accidental corruption. Or just strongly recommend that if you want to change it, set pg_encoding instead of executing SET CLIENT_ENCODING. > + - if we decide to go this route and detect the charset change, we can > + issue a notice/warning from DBD::Pg that the client_encoding has > + changed and then turn off the pg_internal flag. But only if pg_internal was not explicitly set by the user, right? > + - if everything checks out, we use the usual dequote_* methods and > + set the SvUTF8 flag on either text-based bytes, or set only on the > + ASCII datums. > + > + - a possible option to benchmark would be to directly use the > + "utf8::upgrade" method from the perl internals (or some Sv-creation > + method based on (char*)) to take advantage of any perl-specific > + enhancements already in place. This may be just as fast since perl > + already needs to copy the (char*) contents into the SV, and may > + already have fast-tracked code-paths for this type of operation, > + since we know the data will be valid UTF8. > + > +When processing data coming *in* from the user i.e., (SV*) we consider > +the following: > + > + - if pg_internal is 0, pass through the normal methods unabashed. > + > + - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we > + do nothing; the underlying (char*) will already be in utf-8 data. Maybe. utf8 ne UTF-8, quite. > + - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need > + special consideration for hi-bit characters; since we've > + effectively co-opted the expected client_encoding and forced UTF8, > + we need to treat the raw data as octets. We have a couple choices: > + > + - treat as latin-1/perl raw. This may be a good default choice, > + but I'm not 100% convinced; in any case we would need to > + convert from raw to utf-8 using utf8::upgrade. I think this is basically what Perl assumes, so it's probably pretty safe. It would also be the reasonable thing to do if pg_encoding is set to something other than utf-8: you assume the user knows what she's doing and passing things in the proper encoding. > + > + - treat as original client_encoding. This may be the least > + changed expectation as far as the user is concerned, but > + requires us to either: > + > + a) switch client_encoding for query to the original > + client_encoding, while somehow still retaining the utf-8 > + client encoding for result set retrieval, or, > + > + b) actually use Encode to transcode from the original > + client_encoding to UTF8. I think GSM is particularly > + against bringing Encode into the picture just due to > + additional complexity issues. To me, this is just more reason to use pg_encoding and not have pg_internal at all. > + > +Implementation considerations/ideas > +----------------------------------- > + > +DWC feels strongly that we should avoid setting the SvUTF8 flag on any > +retrieved/created SV which does not require it; Why? > as such, an operation > +that can quickly check whether there are any hi-bit characters in a > +given (char*) would need to be weighed against the possible > +inconvenience of *always* setting the SvUTF8 flag on eligible strings, > +regardless of whether it is full ASCII. Yeah, needs benchmarking. And if it's slow and you still want it, maybe give us a knob to turn it off. Best, David
