Re: Add Unicode Support to the DBI

David E . Wheeler Mon, 03 Oct 2011 10:17:16 -0700

On Oct 2, 2011, at 8:49 PM, Greg Sabino Mullane wrote:

> DEW> I assume you also mean to say that data sent *to* the database 
> DEW> has the flag turned off, yes?
> 
> No: that is undefined. I don't see it as the DBDs job to massage data 
> going into the database. Or at least, I cannot imagine a DBI interface 
> for that.


Uh, say what? Just as I need to

   binmode STDOUT, ':utf8';

Before sending stuff to STDOUT (that is, turn off the flag), I would expect 
DBDs to do the same before sending data to the database. Unless, of course, it 
"just works".

> DEW> Yeah, maybe should be utf8_flag instead.
> 
> Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely.

Yeah, better, though it' just perpetuates Perl's unfortunate use of the term 
"utf8" for "internal string representation." Though I suppose that ship has 
sunk already.

> Yeah, that last one is the current Postgres plan. Which I think should 
> be best practice and a default DBI expectation.

Agreed.

> DEW> DBDs will decode the data as needed.
> DEW> I don't understand this sentence. If the flag is 
> DEW> flipped, why will it decode?
> 
> Because it may still need to convert things. See the ODBC discussion.

Oh, so you're saying it will decode and encode between Perl's internal form and 
UTF-8, rather than just flip the flag on and off?

> GSM>> If this is set off, the utf8 flag will never be set, and no 
> GSM>> decoding will be done on data coming back from the database.
> 
> DEW> What if the data coming back from the database 
> DEW> is Big5 and I want to decode it?
> 
> Eh? You just asked above why would we ever decode it?

Yes, because you were only talking about utf8 and UTF-8, not any other 
encodings. Unless I missed something. If the data coming back from the DB is 
Big5, I may well want to have some way to decode it (and to encode it for write 
statements).

> DEW> You mean never allow it to be flipped when the 
> DEW> database encoding is SQL_ASCII?
> 
> Yes, basically. But perhaps it does not matter too much. SQL_ASCII 
> is such a bad idea anyway, I feel no need to coddle people using it. :)

+1

> MJE> So is the problem that sometimes a DBD does not know what to encode data 
> MJE> being sent to the database or how/whether to decode data coming back 
> from 
> MJE> the database? and if that is the case do we need some settings in DBI 
> MJE> to tell a DBD?
> 
> I think that's one of the things that is being argued for, here.

Yes.

> MJE> I think this was my point above, i.e., why utf8? databases accept and 
> MJE> supply a number of encodings so why have a flag called utf8? are we 
> MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more 
> MJE> sense to have a flag where you can set the encoding in the same form 
> MJE> Encode uses.
> 
> Well, because utf-8 is pretty much a defacto encoding, or at least way, way 
> more popular than things like ucs2. Also, the Perl utf8 flag encourages 
> us to put everything into UTF-8.

Yeah, but again, that might be some reason to call it something else, like 
"perl_native" or something. The fact that it happens to be UTF-8 should be 
irrelevant. ER, except, I guess, you still have to know the encoding of the 
database.

> MJE> and what about when the DBD knows you are wrong because the database 
> MJE> says it is returning data in encoding X but you ask for Y.
> 
> I would assume that the DBD should attempt to convert it to Y if that 
> is what the user wants.

And throw exceptions as appropriate (encoding/decoding failure).

> MJE> (examples of DBD flags)
> 
> Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so 
> perhaps we should start by focusing on that, or at least decoupling that 
> entirely from decoding? If we assume that the default DBI behavior, or more 
> specifically the default behavior for a random DBD someone picks up is 
> "flip the flag on if the data is known to be UTF-8", then we can propose a 
> DBI attribute, call it utf8_flag, that has three states:
> 
> * 'A': the default, it means the DBD should do the best thing, which in most 
> cases means setting SvUTF8_on if the data coming back is UTF-8.
> * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned 
> data, even if it thinks it may not be UTF-8.
> * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it 
> thinks the data is.

I still prefer an encoding attribute that you can set as follows:

* undef: Default; same as your A.
* ':utf8': Same as your B:
* ':raw': Same as your C
* $encoding: Encode/decode to/from $encoding

> I presume the other half would be an encoding, such that
> $h->{encoding} would basically ask the DBD to make any returned 
> data into that encoding, by hook or by crook.

With an encoding attribute, you don't need the utf8_flag at all.

Best,

David

Re: Add Unicode Support to the DBI

Reply via email to