Re: What to do with UTF-8 data?
Hi, Steve, [...] The problem is: How do I trap all input/output to/from DBI to do these conversions? [...] I've asked about this on the dbi-users mailing list, and the answer (from Tim Bunce, no less) was that it is really the responsibility of the DBD driver to perform such conversions if the data in question is UTF-8. after letting my thoughts settle I come to the conclusion that I do not agree completely. I think that DBI should do 80% of the job and leave about 20% to the driver authors. It is right, that the driver author possibly knows how to detect the encoding of columns in the database. However, this is *only* possibly, because doing a SHOW COLUMNS FROM mytable with any prepare statement is not an option, IMO. There has to be at least a possibility to say this column is encoded in ISO-8859-1, but please be so kind to convert into the UTF-8 which we are using in nowadays Perl. Whether MySQL 4.1 does return such flag or not, doesn't matter: The driver still has to work with elder versions, where encoding matters quite the same. IMO a good approach would go like follows: - DBI's got to know about encodings. It is the task of the generic layer to decide on the character encoding of the input data, aka SQL statements and placeholder values. In short: DBI must decide on What is input data and how is it encoded? - DBI should also know about the *desired* encodings of input data. Of course, it would be nice, if driver authors can provide the ability to determine these value automatically. However, IMO the DBI user should be able to override. Also, we should not forget that people may use binary columns to store text data, in which case the driver author will never be able to give suitable information. - Based on the above facts, DBI would be able to decide that we need to convert from encoding1 to encoding2. If they are the same, DBI could suppress the following. Otherwise DBI could invoke a method doing the conversion. That method ought to be overridable by the driver author. And it ought to be implementable in Perl or C. - Likewise for the output. In short: Let DBI decide when to convert. And give it suitable hooks that allow the driver authors to provide information. Jochen
Re: What to do with UTF-8 data?
On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote: Hi, Steve, The problem is: How do I trap all input/output to/from DBI to do these conversions? I've asked about this on the dbi-users mailing list, and the answer (from Tim Bunce, no less) was that it is really the responsibility of the DBD driver to perform such conversions if the data in question is UTF-8. That's not quite right. I wasn't talking about any _conversions_ at all. after letting my thoughts settle I come to the conclusion that I do not agree completely. I think that DBI should do 80% of the job and leave about 20% to the driver authors. For a full solution yes, I agree - and I've written about this in the past. For now I'm just talking about the specific but fairly common situation of fetching data that is utf8 encoded but it doesn't get flagged as such by the driver. For that case the driver just needs to know when to do a SvUTF8_on(sv). Tim.
Re: What to do with UTF-8 data?
Tim Bunce wrote: On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote: Hi, Steve, The problem is: How do I trap all input/output to/from DBI to do these conversions? I've asked about this on the dbi-users mailing list, and the answer (from Tim Bunce, no less) was that it is really the responsibility of the DBD driver to perform such conversions if the data in question is UTF-8. That's not quite right. I wasn't talking about any _conversions_ at all. I'm sorry if I mis-quoted you. I meant setting the UTF-8 flag on an octet sequence that can be interpreted as UTF-8, rather than leaving it unflagged and treated as Latin-1. Thus, the data is in some sense converted from Latin-1 to UTF-8. after letting my thoughts settle I come to the conclusion that I do not agree completely. I think that DBI should do 80% of the job and leave about 20% to the driver authors. For a full solution yes, I agree - and I've written about this in the past. For now I'm just talking about the specific but fairly common situation of fetching data that is utf8 encoded but it doesn't get flagged as such by the driver. For that case the driver just needs to know when to do a SvUTF8_on(sv). Exactly. What about data going _into_ the database? In my examples of doing the conversion manually with Encode::{en|de}code_utf8(), I was converting the Perl strings to octet sequences that could later be interpreted as UTF-8 before insertion into the database. That way I could guarantee that all data retrieved from the database can be converted to UTF-8, in fact (as you pointed out) by simply turning the UTF-8 flag on. If all the data that I insert really is UTF-8 then I guess it will just get serialised as a sequence of octets, and everything will be OK. But what if the data I'm inserting isn't all UTF-8? The problem is: 1. Perl's internal format isn't just UTF-8 -- it defaults to Latin-1 (or whatever) for strings in which every character can be represented in Latin-1; 2. The 8-bit characters of Latin-1 are represented as two-byte characters in UTF-8. So, if I have the string Copyright © Fred Bloggs in Perl then it will not be UTF-8: the © is stored as one byte, not two not, and the UTF-8 flag is off. If I insert that straight into the database without running it through Encode::encode_utf8() first, then © itself, rather than its two-byte UTF-8 representation gets stored in the database, so when it gets retrieved from the database later you can't just turn the UTF-8 flag on -- you would need to run it through Encode::decode_utf8(). In other words, just having the driver switching the UTF-8 on and off will only work if I guarantee that all the strings I feed it to start with really are UTF-8, even when Perl would not normally have represented them as such. It would be cool if something akin to binmode STDOUT, ':utf8'; could be applied when sending data to the driver -- i.e. my data is in Perl's internal format, whether that be Latin-1 or UTF-8 in the case of the string at hand, and it all gets automagically upgraded to UTF-8 if necessary before insertion into the database. Then you only need to turn the flag on when retrieving it again. At least, I think that's what I want :-s - Steve
Re: What to do with UTF-8 data?
On Thu, 11 Sep 2003 12:31:52 +0100, Steve Hay wrote: It would be cool if something akin to binmode STDOUT, ':utf8'; could be applied when sending data to the driver -- i.e. my data is in Perl's internal format, whether that be Latin-1 or UTF-8 in the case of the string at hand, and it all gets automagically upgraded to UTF-8 if necessary before insertion into the database. Oh that's easy to achieve. Just concatenate the string with an UTF-8 string, and you'll get an UTF-8 string. Perl will do the upgrading for you. Just try it: $zero_length_utf8 = pack U0; # UTF8, length ==0 $string = élève; # Latin-1 $string .= $zero_length_utf8; # upgrade to UTF8 print $string; Now the reverse is much harder... :) -- Bart.
Re: What to do with UTF-8 data?
Chuck Fox wrote: Steve, I am a Sybase DBA, but in situations like this, I have declared the column on the table to be varbinary or binary and stored the data directly without conversion. Don't know if MySql supports this datatype. One can declare a column VARCHAR(n) BINARY for a similar thing to Sybase's VARBINARY(n), but unfortunately it makes no difference. If I store the data directly without conversion into such a column then, as far as I can make out, each character's bytes get stored exactly as if I had converted to bytes beforehand. And when I retrieve the data (again without conversion) I just get octet sequences into my Perl scalars - not flagged, UTF-8 character strings as I would like. This is all exactly the same as what I get with a VARCHAR(n) column. - Steve
Re: What to do with UTF-8 data?
On 2003-09-10 08:33:03 +0100, Steve Hay wrote: Chuck Fox wrote: I am a Sybase DBA, but in situations like this, I have declared the column on the table to be varbinary or binary and stored the data directly without conversion. Don't know if MySql supports this datatype. One can declare a column VARCHAR(n) BINARY for a similar thing to Sybase's VARBINARY(n), but unfortunately it makes no difference. If I store the data directly without conversion into such a column then, as far as I can make out, each character's bytes get stored exactly as if I had converted to bytes beforehand. And when I retrieve the data (again without conversion) I just get octet sequences into my Perl scalars - not flagged, UTF-8 character strings as I would like. This is all exactly the same as what I get with a VARCHAR(n) column. Yes, but the difference is that with a binary column the database knows that the data is not character data in a charset it knows. For example the MySQL manual states: Values in CHAR and VARCHAR columns are sorted and compared in case-insensitive fashion, unless the BINARY attribute was specified when the table was created. This doesn't work with UTF-8 data of course, because the individual bytes of a multi-byte-character are not whole characters and hence don't have a uppercase or lowercase equivalent. So two strings which should compare equal generally won't, and sometimes strings which should not compare equal will. If you just tell the database this is binary data, not character data it won't try to do case conversions on it and it (hopefully) will stop you doing them in SQL code (You can and must do them in perl). hp Disclaimer: I haven't actually used MySQL for some time, this is general advice, not targeted at the specific way MySQL compares strings. -- _ | Peter J. Holzer | Unser Universum wäre betrüblich |_|_) | Sysadmin WSR / LUGA | unbedeutend, hätte es nicht jeder | | | [EMAIL PROTECTED]| Generation neue Probleme bereit. __/ | http://www.hjp.at/ | -- Seneca, naturales quaestiones pgp0.pgp Description: PGP signature
Re: What to do with UTF-8 data?
On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote: And when I retrieve the data (again without conversion) I just get octet sequences into my Perl scalars - not flagged, UTF-8 character strings as I would like. If you're *sure* that this is UTF-8, only perl doesn't flag it as such, you can set the flag yourself. In perl 5.8.x, you can use the Encode module, one of the functions documented near the bottom, here the _utf8_on(STRING). That's an inplace modifying function, so you use it like _utf8_on($string_that_should_be_utf8); For perl 5.6.x (and likely for 5.8 too), you can achieve the same effect by using pack() this way: $flagged_as_utf8 = pack U0a*, $string_that_should_be_utf8; Earlier perls than 5.6 don't have this UTF8 flag, nor do they accept the U template. -- Bart.
Re: What to do with UTF-8 data?
Bart Lateur wrote: On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote: And when I retrieve the data (again without conversion) I just get octet sequences into my Perl scalars - not flagged, UTF-8 character strings as I would like. If you're *sure* that this is UTF-8, only perl doesn't flag it as such, you can set the flag yourself. In perl 5.8.x, you can use the Encode module, one of the functions documented near the bottom, here the _utf8_on(STRING). That's an inplace modifying function, so you use it like _utf8_on($string_that_should_be_utf8); Yep, that's an alternative to the $now_is_utf8 = Encode::decode_utf8($should_be_utf8) call that I described in my original posting. But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? - Steve
Re: What to do with UTF-8 data?
On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? Well, there are two options... either does the dtabase somewhere stores the flag indicating that some string is in UTF8, or you have to add that information yourself. For the latter, I don't know if it'll actually work, but it seems like an appropriate way to do it: add a BOM marker at the start of the string. http://www.unicode.org/unicode/faq/utf_bom.html#22 (and below) -- Bart.
Re: What to do with UTF-8 data?
Bart Lateur wrote: On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? Well, there are two options... either does the dtabase somewhere stores the flag indicating that some string is in UTF8, or you have to add that information yourself. For the latter, I don't know if it'll actually work, but it seems like an appropriate way to do it: add a BOM marker at the start of the string. I don't think the MySQL 3.x stores any flag to indicate that a string is UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql that would make use of it, e.g. to decode data flagged in such a way into Perl's internal format. Adding a BOM myself to the string seems to have problems of its own (see http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not aware of DBI / DBD-mysql having anything in them that would make use of such a BOM. Please correct me if I'm wrong - that could be just the sort of thing that I'm looking for here. - Steve
Re: What to do with UTF-8 data?
On 2003-09-10 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? You could subclass DBI or DBD::MySQL and replace all methods with wrappers which perform the conversion. I'm not convinced that this is a good idea, though. I'd rather try to the the conversion in application specific layer above DBI. hp -- _ | Peter J. Holzer | Unser Universum wäre betrüblich |_|_) | Sysadmin WSR / LUGA | unbedeutend, hätte es nicht jeder | | | [EMAIL PROTECTED]| Generation neue Probleme bereit. __/ | http://www.hjp.at/ | -- Seneca, naturales quaestiones pgp0.pgp Description: PGP signature
Re: What to do with UTF-8 data?
On 2003-09-10 12:14:25 +0200, Bart Lateur wrote: On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? Well, there are two options... either does the dtabase somewhere stores the flag indicating that some string is in UTF8, or you have to add that information yourself. For the latter, I don't know if it'll actually work, but it seems like an appropriate way to do it: add a BOM marker at the start of the string. That doesn't help Steve. He already knows that the data is UTF-8, he doesn't need the marker to distinguish between UTF-8 and Latin-X. His problem is that when he selects from the database, he has to manually convert from utf-8 to perl-internal: while (my ($foo, $bar) = $sth-fetchrow_array()) { $foo = decode_utf8($foo); $bar = decode_utf8($bar); # do something with foo and bar } and he wants to happen the decode step automatically. Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL 3.23 to 4.1? hp -- _ | Peter J. Holzer | Unser Universum wäre betrüblich |_|_) | Sysadmin WSR / LUGA | unbedeutend, hätte es nicht jeder | | | [EMAIL PROTECTED]| Generation neue Probleme bereit. __/ | http://www.hjp.at/ | -- Seneca, naturales quaestiones pgp0.pgp Description: PGP signature
Re: What to do with UTF-8 data?
On Wed, Sep 10, 2003 at 11:42:23AM +0100, Steve Hay wrote: Bart Lateur wrote: On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? Well, there are two options... either does the dtabase somewhere stores the flag indicating that some string is in UTF8, or you have to add that information yourself. For the latter, I don't know if it'll actually work, but it seems like an appropriate way to do it: add a BOM marker at the start of the string. I don't think the MySQL 3.x stores any flag to indicate that a string is UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql that would make use of it, e.g. to decode data flagged in such a way into Perl's internal format. Adding a BOM myself to the string seems to have problems of its own (see http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not aware of DBI / DBD-mysql having anything in them that would make use of such a BOM. Please correct me if I'm wrong - that could be just the sort of thing that I'm looking for here. Basically it should be the job of the drivers to set the uft8 flag on data being retrieved if it is utf8. I believe that the new mysql v4.1 protocol does provide information about the characterset of each colum. DBD::mysql can use that. For people stuck with older versions of mysql, a driver private option could be used to indicate that all char fields are utf8, or have some way of indicating that per-column, such as $sth-bind_col(1, undef, { mysql_charset = 'utf8' }); Tim.
Re: What to do with UTF-8 data?
Peter J. Holzer wrote: Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL 3.23 to 4.1? I might be able to upgrade to 4.0 (I fact, I really ought to...), but I don't fancy 4.1 just yet -- it's still an alpha release :-( - Steve
Re: What to do with UTF-8 data?
Peter J. Holzer wrote: On 2003-09-10 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? You could subclass DBI or DBD::MySQL and replace all methods with wrappers which perform the conversion. I'm not convinced that this is a good idea, though. I'd rather try to the the conversion in application specific layer above DBI. Sub-classing DBI wouldn't help me since I'm using Class::DBI which won't know to use my DBI sub-class. Sub-classing DBD-mysql might be more feasible, though, since I specify what driver to use. Actually, I had hoped to find some appropriate hooks in Class::DBI, but I don't see any. There are select and before_set triggers, but the latter are per-column, which would be a real pain to set up. - Steve
Re: What to do with UTF-8 data?
Tim Bunce wrote: On Wed, Sep 10, 2003 at 11:42:23AM +0100, Steve Hay wrote: Bart Lateur wrote: On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote: But the question was: How can I arrange for such conversions to be performed automatically by DBI whenever it receives or returns data? Well, there are two options... either does the dtabase somewhere stores the flag indicating that some string is in UTF8, or you have to add that information yourself. For the latter, I don't know if it'll actually work, but it seems like an appropriate way to do it: add a BOM marker at the start of the string. I don't think the MySQL 3.x stores any flag to indicate that a string is UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql that would make use of it, e.g. to decode data flagged in such a way into Perl's internal format. Adding a BOM myself to the string seems to have problems of its own (see http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not aware of DBI / DBD-mysql having anything in them that would make use of such a BOM. Please correct me if I'm wrong - that could be just the sort of thing that I'm looking for here. Basically it should be the job of the drivers to set the uft8 flag on data being retrieved if it is utf8. I believe that the new mysql v4.1 protocol does provide information about the characterset of each colum. DBD::mysql can use that. Ah. In that case, I should get onto the DBD-mysql people to look for assistance. I was thinking that DBI itself would be adding some kind of UTF-8 support. For people stuck with older versions of mysql, a driver private option could be used to indicate that all char fields are utf8, or have some way of indicating that per-column, such as $sth-bind_col(1, undef, { mysql_charset = 'utf8' }); OK, I'll pass this suggestion on to the DBD-mysql maintainer(s). Thanks, - Steve
Re: What to do with UTF-8 data?
Steve, I am a Sybase DBA, but in situations like this, I have declared the column on the table to be varbinary or binary and stored the data directly without conversion. Don't know if MySql supports this datatype. HTH, Chuck Fox Principal Database Adminstrator America Online, INC. [EMAIL PROTECTED] wrote: Hi, Having looked through the list archives a bit, I see there has been lots of discussion over what to do with UTF-8 data, but I can't find anything that really helps me out now. I'm running Perl 5.8.0 with MySQL 3.23.56 (via DBD-mysql). Since (I think) there is no native UTF-8 support in MySQL below 4.1.x, my plan was to simply store the bytes of each UTF-8 character in the database (either by explicitly converting Perl's UTF-8 strings to sequences of octets using Encode::encode_utf8(), or else just letting that happen of its own accord as it seems to do :-s), and then convert such octet sequences back to UTF-8 strings when retrieving data from the database using Encode::decode_utf8(). As long as I store *all* my data in this way, those conversions should always succeed. (encode_utf8() never fails anyway, and decode_utf8() will always work here because I'm always feeding it valid data.) The problem is: How do I trap all input/output to/from DBI to do these conversions? I can easily do it manually: $dbh-do('INSERT INTO foo (bar) VALUES (?)', undef, Encode::encode_utf8($input_utf8str)); ... my @octets_row = $dbh-selectrow_array('SELECT bar FROM foo'); my $output_utf8str2 = Encode::decode_utf8($octets_row[0]); but that's way too tedious in practice. I want to have those conversions done for me automatically, perhaps by having specified some appropriate encode/decode callbacks for certain hooks. Do any such hooks exist? If, as I suspect, there are not, then what is the currently recommended solution here? Would it be at all easy to add such hooks, or some other means of specifying that I want these conversions done, to DBI as a stop-gap solution? Thanks, - Steve