On Oct 2, 2011, at 8:49 PM, Greg Sabino Mullane wrote: > DEW> I assume you also mean to say that data sent *to* the database > DEW> has the flag turned off, yes? > > No: that is undefined. I don't see it as the DBDs job to massage data > going into the database. Or at least, I cannot imagine a DBI interface > for that.
Uh, say what? Just as I need to binmode STDOUT, ':utf8'; Before sending stuff to STDOUT (that is, turn off the flag), I would expect DBDs to do the same before sending data to the database. Unless, of course, it "just works". > DEW> Yeah, maybe should be utf8_flag instead. > > Yes, very bad example. Let's call it utf8. Forget 'unicode' entirely. Yeah, better, though it' just perpetuates Perl's unfortunate use of the term "utf8" for "internal string representation." Though I suppose that ship has sunk already. > Yeah, that last one is the current Postgres plan. Which I think should > be best practice and a default DBI expectation. Agreed. > DEW> DBDs will decode the data as needed. > DEW> I don't understand this sentence. If the flag is > DEW> flipped, why will it decode? > > Because it may still need to convert things. See the ODBC discussion. Oh, so you're saying it will decode and encode between Perl's internal form and UTF-8, rather than just flip the flag on and off? > GSM>> If this is set off, the utf8 flag will never be set, and no > GSM>> decoding will be done on data coming back from the database. > > DEW> What if the data coming back from the database > DEW> is Big5 and I want to decode it? > > Eh? You just asked above why would we ever decode it? Yes, because you were only talking about utf8 and UTF-8, not any other encodings. Unless I missed something. If the data coming back from the DB is Big5, I may well want to have some way to decode it (and to encode it for write statements). > DEW> You mean never allow it to be flipped when the > DEW> database encoding is SQL_ASCII? > > Yes, basically. But perhaps it does not matter too much. SQL_ASCII > is such a bad idea anyway, I feel no need to coddle people using it. :) +1 > MJE> So is the problem that sometimes a DBD does not know what to encode data > MJE> being sent to the database or how/whether to decode data coming back > from > MJE> the database? and if that is the case do we need some settings in DBI > MJE> to tell a DBD? > > I think that's one of the things that is being argued for, here. Yes. > MJE> I think this was my point above, i.e., why utf8? databases accept and > MJE> supply a number of encodings so why have a flag called utf8? are we > MJE> going to have ucs2, utf16, utf32 flags as well. Surely, it makes more > MJE> sense to have a flag where you can set the encoding in the same form > MJE> Encode uses. > > Well, because utf-8 is pretty much a defacto encoding, or at least way, way > more popular than things like ucs2. Also, the Perl utf8 flag encourages > us to put everything into UTF-8. Yeah, but again, that might be some reason to call it something else, like "perl_native" or something. The fact that it happens to be UTF-8 should be irrelevant. ER, except, I guess, you still have to know the encoding of the database. > MJE> and what about when the DBD knows you are wrong because the database > MJE> says it is returning data in encoding X but you ask for Y. > > I would assume that the DBD should attempt to convert it to Y if that > is what the user wants. And throw exceptions as appropriate (encoding/decoding failure). > MJE> (examples of DBD flags) > > Almost all the examples from DBDs seem to be focusing on the SvUTF8 flag, so > perhaps we should start by focusing on that, or at least decoupling that > entirely from decoding? If we assume that the default DBI behavior, or more > specifically the default behavior for a random DBD someone picks up is > "flip the flag on if the data is known to be UTF-8", then we can propose a > DBI attribute, call it utf8_flag, that has three states: > > * 'A': the default, it means the DBD should do the best thing, which in most > cases means setting SvUTF8_on if the data coming back is UTF-8. > * 'B': (on). The DBD should make every effort to set SvUTF8_on for returned > data, even if it thinks it may not be UTF-8. > * 'C': (off). The DBD should not call SvUTF8_on, regardless of what it > thinks the data is. I still prefer an encoding attribute that you can set as follows: * undef: Default; same as your A. * ':utf8': Same as your B: * ':raw': Same as your C * $encoding: Encode/decode to/from $encoding > I presume the other half would be an encoding, such that > $h->{encoding} would basically ask the DBD to make any returned > data into that encoding, by hook or by crook. With an encoding attribute, you don't need the utf8_flag at all. Best, David