Re: Add Unicode Support to the DBI

David E . Wheeler Thu, 22 Sep 2011 09:36:39 -0700

On Sep 22, 2011, at 2:26 AM, Martin J. Evans wrote:

> There is more than one way to encode unicode - not everyone uses UTF-8; 
> although some encodings don't support all of unicode.


Yeah, maybe should be utf8_flag instead.

> unicode is not encoded as UTF-8 in ODBC using the wide APIs.
> 
> Using the wide ODBC APIs returns data in UCS2 encoding and DBD::ODBC decodes 
> it. Using the ANSI APIs data is returned as octets and is whatever it is - it 
> may be ASCII, it may be UTF-8 encoded (only in 2 cases I know and I believe 
> they are flawed anyway) it may be something else in which case the 
> application needs to know what it is. In the case of octets which are UTF-8 
> encoded DBD::ODBC has no idea that is the case unless you tell it and it will 
> then set the UTF-8 flag (but see later).

Right. There needs to be a way to tell the DBI what encoding the server sends 
and expects to be sent. If it's not UTF-8, then the utf8_flag option is kind of 
useless.

> I'm not that familiar with Postgres (I've used a few times and not to any 
> great degree) and I used MySQL for a while years ago. I occasionally use 
> SQLite. I do use DBD::Oracle and DBD::ODBC all the time. I'm still struggling 
> to see the problem that needs fixing. Is it just that some people would like 
> a DBI flag which tells the DBD:
> 
> 1) decode any data coming back from the database strictly such that if it is 
> invalid you die
> 2) decode any data coming back from the database loosely (e.g., utf-8 vs 
> UTF-8)
> 3) don't decode the data from the database at all
> 4) don't decode the data, the DBD knows it is say UTF-8 encoded and simply 
> sets the UTF-8 flag (which from what I read is horribly flawed but seems to 
> work for me).
> 
> and the reverse.

Yes, with one API for all drivers, if possible, and guidelines for how it 
should work (when to encode and decode, what to encode and decode, when to just 
flip the utf8 flag on and off, etc.).

> DBD::Oracle does 1 some of the time and it does 4 the rest of the time e.g. 
> error messages are fully decoded from UTF-8 IF Oracle is sending UTF-8 and it 
> does 4 on most of the column data IF Oracle is sending UTF-8.

Yeah, but to enable it *you set a bloody environment variable*. WHAT?

> My point being, doesn't the DBD know how the data is encoded when it gets it 
> from the database? and it would hopefully know what the database needs when 
> sending data. Perhaps in some conditions the DBD does not know this and needs 
> to be told (I could imagine SQLite reading/writing straight to files for 
> instance might want to know to open the file with UTF-8 layer).

Or to turn it off, so you can just pass the encoded UTF-8 through to the file 
without the decode/encode round-trip.

> So is the problem that sometimes a DBD does not know what to encode data 
> being sent to the database or how/whether to decode data coming back from the 
> database? and if that is the case do we need some settings in DBI to tell a 
> DBD?

That's an issue, yes, but the main issue is that all the drivers do it 
differently, sometimes with different semantics, and lack all the functionality 
one might want (e.g., your examples 1-4).

Best,

David

Re: Add Unicode Support to the DBI

Reply via email to