Re: What to do with UTF-8 data?

2003-09-11 Thread Jochen Wiedmann


Hi, Steve,

[...]

 The problem is: How do I trap all input/output to/from DBI to do these 
 conversions?

[...]

 I've asked about this on the dbi-users mailing list, and the answer 
 (from Tim Bunce, no less) was that it is really the responsibility of 
 the DBD driver to perform such conversions if the data in question is UTF-8.

after letting my thoughts settle I come to the conclusion that I do not
agree completely. I think that DBI should do 80% of the job and leave
about 20% to the driver authors.

It is right, that the driver author possibly knows how to detect the
encoding of columns in the database. However, this is *only* possibly,
because doing a SHOW COLUMNS FROM mytable with any prepare statement
is not an option, IMO. There has to be at least a possibility to say
this column is encoded in ISO-8859-1, but please be so kind to convert
into the UTF-8 which we are using in nowadays Perl. Whether MySQL 4.1
does return such flag or not, doesn't matter: The driver still has to
work with elder versions, where encoding matters quite the same.

IMO a good approach would go like follows:

- DBI's got to know about encodings. It is the task of the generic
  layer to decide on the character encoding of the input data, aka
  SQL statements and placeholder values.

  In short: DBI must decide on What is input data and how is it
  encoded?

- DBI should also know about the *desired* encodings of input data.
  Of course, it would be nice, if driver authors can provide the
  ability to determine these value automatically. However, IMO the
  DBI user should be able to override. Also, we should not forget that
  people may use binary columns to store text data, in which case the
  driver author will never be able to give suitable information.

- Based on the above facts, DBI would be able to decide that we need
  to convert from encoding1 to encoding2. If they are the same, DBI
  could suppress the following. Otherwise DBI could invoke a method
  doing the conversion. That method ought to be overridable by the
  driver author. And it ought to be implementable in Perl or C.

 - Likewise for the output.


In short: Let DBI decide when to convert. And give it suitable hooks
that allow the driver authors to provide information.


Jochen



Re: What to do with UTF-8 data?

2003-09-11 Thread Tim Bunce
On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:
 Hi, Steve,
 
  The problem is: How do I trap all input/output to/from DBI to do these 
  conversions?
 
  I've asked about this on the dbi-users mailing list, and the answer 
  (from Tim Bunce, no less) was that it is really the responsibility of 
  the DBD driver to perform such conversions if the data in question is UTF-8.

That's not quite right. I wasn't talking about any _conversions_ at all.

 after letting my thoughts settle I come to the conclusion that I do not
 agree completely. I think that DBI should do 80% of the job and leave
 about 20% to the driver authors.

For a full solution yes, I agree - and I've written about this in the past.

For now I'm just talking about the specific but fairly common
situation of fetching data that is utf8 encoded but it doesn't
get flagged as such by the driver.

For that case the driver just needs to know when to do a SvUTF8_on(sv).

Tim.


Re: What to do with UTF-8 data?

2003-09-11 Thread Steve Hay
Tim Bunce wrote:

On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:
 

Hi, Steve,

   

The problem is: How do I trap all input/output to/from DBI to do these 
conversions?
 

I've asked about this on the dbi-users mailing list, and the answer 
(from Tim Bunce, no less) was that it is really the responsibility of 
the DBD driver to perform such conversions if the data in question is UTF-8.
 

That's not quite right. I wasn't talking about any _conversions_ at all.

I'm sorry if I mis-quoted you.  I meant setting the UTF-8 flag on an 
octet sequence that can be interpreted as UTF-8, rather than leaving it 
unflagged and treated as Latin-1.  Thus, the data is in some sense 
converted from Latin-1 to UTF-8.

 

after letting my thoughts settle I come to the conclusion that I do not
agree completely. I think that DBI should do 80% of the job and leave
about 20% to the driver authors.
   

For a full solution yes, I agree - and I've written about this in the past.

For now I'm just talking about the specific but fairly common
situation of fetching data that is utf8 encoded but it doesn't
get flagged as such by the driver.
For that case the driver just needs to know when to do a SvUTF8_on(sv).

Exactly.

What about data going _into_ the database?  In my examples of doing the 
conversion manually with Encode::{en|de}code_utf8(), I was converting 
the Perl strings to octet sequences that could later be interpreted as 
UTF-8 before insertion into the database.  That way I could guarantee 
that all data retrieved from the database can be converted to UTF-8, in 
fact (as you pointed out) by simply turning the UTF-8 flag on.

If all the data that I insert really is UTF-8 then I guess it will just 
get serialised as a sequence of octets, and everything will be OK.

But what if the data I'm inserting isn't all UTF-8?  The problem is:

1. Perl's internal format isn't just UTF-8 -- it defaults to Latin-1 (or 
whatever) for strings in which every character can be represented in 
Latin-1;
2. The 8-bit characters of Latin-1 are represented as two-byte 
characters in UTF-8.

So, if I have the string Copyright © Fred Bloggs in Perl then it will 
not be UTF-8: the © is stored as one byte, not two not, and the UTF-8 
flag is off.  If I insert that straight into the database without 
running it through Encode::encode_utf8() first, then © itself, rather 
than its two-byte UTF-8 representation gets stored in the database, so 
when it gets retrieved from the database later you can't just turn the 
UTF-8 flag on -- you would need to run it through Encode::decode_utf8().

In other words, just having the driver switching the UTF-8 on and off 
will only work if I guarantee that all the strings I feed it to start 
with really are UTF-8, even when Perl would not normally have 
represented them as such.

It would be cool if something akin to binmode STDOUT, ':utf8'; could 
be applied when sending data to the driver -- i.e. my data is in Perl's 
internal format, whether that be Latin-1 or UTF-8 in the case of the 
string at hand, and it all gets automagically upgraded to UTF-8 if 
necessary before insertion into the database.  Then you only need to 
turn the flag on when retrieving it again.

At least, I think that's what I want :-s

- Steve



Re: What to do with UTF-8 data?

2003-09-11 Thread Bart Lateur
On Thu, 11 Sep 2003 12:31:52 +0100, Steve Hay wrote:

It would be cool if something akin to binmode STDOUT, ':utf8'; could 
be applied when sending data to the driver -- i.e. my data is in Perl's 
internal format, whether that be Latin-1 or UTF-8 in the case of the 
string at hand, and it all gets automagically upgraded to UTF-8 if 
necessary before insertion into the database. 

Oh that's easy to achieve. Just concatenate the string with an UTF-8
string, and you'll get an UTF-8 string. Perl will do the upgrading for
you.

Just try it:

$zero_length_utf8 = pack U0;  # UTF8, length ==0
$string = élève;  # Latin-1
$string .= $zero_length_utf8;   # upgrade to UTF8
print $string;

Now the reverse is much harder... :)

-- 
Bart.


Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Chuck Fox wrote:

Steve,

I am a Sybase DBA, but in situations like this, I have declared the 
column on the table to be varbinary or binary and stored the data 
directly without conversion.  Don't know if MySql supports this datatype. 
One can declare a column VARCHAR(n) BINARY for a similar thing to 
Sybase's VARBINARY(n), but unfortunately it makes no difference.  If I 
store the data directly without conversion into such a column then, as 
far as I can make out, each character's bytes get stored exactly as if I 
had converted to bytes beforehand.  And when I retrieve the data (again 
without conversion) I just get octet sequences into my Perl scalars - 
not flagged, UTF-8 character strings as I would like.

This is all exactly the same as what I get with a VARCHAR(n) column.

- Steve



Re: What to do with UTF-8 data?

2003-09-10 Thread Peter J. Holzer
On 2003-09-10 08:33:03 +0100, Steve Hay wrote:
 Chuck Fox wrote:
 I am a Sybase DBA, but in situations like this, I have declared the 
 column on the table to be varbinary or binary and stored the data 
 directly without conversion.  Don't know if MySql supports this datatype. 
 
 One can declare a column VARCHAR(n) BINARY for a similar thing to 
 Sybase's VARBINARY(n), but unfortunately it makes no difference.  If I 
 store the data directly without conversion into such a column then, as 
 far as I can make out, each character's bytes get stored exactly as if I 
 had converted to bytes beforehand.  And when I retrieve the data (again 
 without conversion) I just get octet sequences into my Perl scalars - 
 not flagged, UTF-8 character strings as I would like.
 
 This is all exactly the same as what I get with a VARCHAR(n) column.

Yes, but the difference is that with a binary column the database knows
that the data is not character data in a charset it knows. For example
the MySQL manual states:

  Values in CHAR and VARCHAR columns are sorted and compared in
  case-insensitive fashion, unless the BINARY attribute was specified
  when the table was created.

This doesn't work with UTF-8 data of course, because the individual
bytes of a multi-byte-character are not whole characters and hence don't
have a uppercase or lowercase equivalent. So two strings which should
compare equal generally won't, and sometimes strings which should not
compare equal will. If you just tell the database this is binary data,
not character data it won't try to do case conversions on it and it
(hopefully) will stop you doing them in SQL code (You can and must do
them in perl).

hp

Disclaimer: I haven't actually used MySQL for some time, this is general
advice, not targeted at the specific way MySQL compares strings.

-- 
   _  | Peter J. Holzer  | Unser Universum wäre betrüblich
|_|_) | Sysadmin WSR / LUGA  | unbedeutend, hätte es nicht jeder
| |   | [EMAIL PROTECTED]| Generation neue Probleme bereit.
__/   | http://www.hjp.at/   |  -- Seneca, naturales quaestiones


pgp0.pgp
Description: PGP signature


Re: What to do with UTF-8 data?

2003-09-10 Thread Bart Lateur
On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote:

And when I retrieve the data (again 
without conversion) I just get octet sequences into my Perl scalars - 
not flagged, UTF-8 character strings as I would like.

If you're *sure* that this is UTF-8, only perl doesn't flag it as such,
you can set the flag yourself. In perl 5.8.x, you can use the Encode
module, one of the functions documented near the bottom, here the
_utf8_on(STRING). That's an inplace modifying function, so you use it
like

_utf8_on($string_that_should_be_utf8);

For perl 5.6.x (and likely for 5.8 too), you can achieve the same effect
by using pack() this way:

$flagged_as_utf8 = pack U0a*, $string_that_should_be_utf8;


Earlier perls than 5.6 don't have this UTF8 flag, nor do they accept the
U template.

-- 
Bart.


Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Bart Lateur wrote:

On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote:

 

And when I retrieve the data (again 
without conversion) I just get octet sequences into my Perl scalars - 
not flagged, UTF-8 character strings as I would like.
   

If you're *sure* that this is UTF-8, only perl doesn't flag it as such,
you can set the flag yourself. In perl 5.8.x, you can use the Encode
module, one of the functions documented near the bottom, here the
_utf8_on(STRING). That's an inplace modifying function, so you use it
like
	_utf8_on($string_that_should_be_utf8);

Yep, that's an alternative to the $now_is_utf8 = 
Encode::decode_utf8($should_be_utf8) call that I described in my 
original posting.

But the question was: How can I arrange for such conversions to be 
performed automatically by DBI whenever it receives or returns data?

- Steve



Re: What to do with UTF-8 data?

2003-09-10 Thread Bart Lateur
On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:

But the question was: How can I arrange for such conversions to be 
performed automatically by DBI whenever it receives or returns data?

Well, there are two options... either does the dtabase somewhere stores
the flag indicating that some string is in UTF8, or you have to add that
information yourself. For the latter, I don't know if it'll actually
work, but it seems like an appropriate way to do it: add a BOM marker
at the start of the string.

http://www.unicode.org/unicode/faq/utf_bom.html#22 (and below)

-- 
Bart.


Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Bart Lateur wrote:

On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:

 

But the question was: How can I arrange for such conversions to be 
performed automatically by DBI whenever it receives or returns data?
   

Well, there are two options... either does the dtabase somewhere stores
the flag indicating that some string is in UTF8, or you have to add that
information yourself. For the latter, I don't know if it'll actually
work, but it seems like an appropriate way to do it: add a BOM marker
at the start of the string.
I don't think the MySQL 3.x stores any flag to indicate that a string is 
UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql 
that would make use of it, e.g. to decode data flagged in such a way 
into Perl's internal format.

Adding a BOM myself to the string seems to have problems of its own (see 
http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not 
aware of DBI / DBD-mysql having anything in them that would make use of 
such a BOM.  Please correct me if I'm wrong - that could be just the 
sort of thing that I'm looking for here.

- Steve




Re: What to do with UTF-8 data?

2003-09-10 Thread Peter J. Holzer
On 2003-09-10 10:40:29 +0100, Steve Hay wrote:
 But the question was: How can I arrange for such conversions to be 
 performed automatically by DBI whenever it receives or returns data?

You could subclass DBI or DBD::MySQL and replace all methods with
wrappers which perform the conversion. I'm not convinced that this is a
good idea, though. I'd rather try to the the conversion in application
specific layer above DBI.

hp

-- 
   _  | Peter J. Holzer  | Unser Universum wäre betrüblich
|_|_) | Sysadmin WSR / LUGA  | unbedeutend, hätte es nicht jeder
| |   | [EMAIL PROTECTED]| Generation neue Probleme bereit.
__/   | http://www.hjp.at/   |  -- Seneca, naturales quaestiones


pgp0.pgp
Description: PGP signature


Re: What to do with UTF-8 data?

2003-09-10 Thread Peter J. Holzer
On 2003-09-10 12:14:25 +0200, Bart Lateur wrote:
 On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:
 
 But the question was: How can I arrange for such conversions to be 
 performed automatically by DBI whenever it receives or returns data?
 
 Well, there are two options... either does the dtabase somewhere stores
 the flag indicating that some string is in UTF8, or you have to add that
 information yourself. For the latter, I don't know if it'll actually
 work, but it seems like an appropriate way to do it: add a BOM marker
 at the start of the string.

That doesn't help Steve. He already knows that the data is UTF-8, he
doesn't need the marker to distinguish between UTF-8 and Latin-X. 

His problem is that when he selects from the database, he has to
manually convert from utf-8 to perl-internal:

while (my ($foo, $bar) = $sth-fetchrow_array()) {
$foo = decode_utf8($foo);
$bar = decode_utf8($bar);

# do something with foo and bar
}

and he wants to happen the decode step automatically. 

Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL
3.23 to 4.1? 

hp

-- 
   _  | Peter J. Holzer  | Unser Universum wäre betrüblich
|_|_) | Sysadmin WSR / LUGA  | unbedeutend, hätte es nicht jeder
| |   | [EMAIL PROTECTED]| Generation neue Probleme bereit.
__/   | http://www.hjp.at/   |  -- Seneca, naturales quaestiones


pgp0.pgp
Description: PGP signature


Re: What to do with UTF-8 data?

2003-09-10 Thread Tim Bunce
On Wed, Sep 10, 2003 at 11:42:23AM +0100, Steve Hay wrote:
 Bart Lateur wrote:
 
 On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:
 
 But the question was: How can I arrange for such conversions to be 
 performed automatically by DBI whenever it receives or returns data?
 
 Well, there are two options... either does the dtabase somewhere stores
 the flag indicating that some string is in UTF8, or you have to add that
 information yourself. For the latter, I don't know if it'll actually
 work, but it seems like an appropriate way to do it: add a BOM marker
 at the start of the string.

 I don't think the MySQL 3.x stores any flag to indicate that a string is 
 UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql 
 that would make use of it, e.g. to decode data flagged in such a way 
 into Perl's internal format.
 
 Adding a BOM myself to the string seems to have problems of its own (see 
 http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not 
 aware of DBI / DBD-mysql having anything in them that would make use of 
 such a BOM.  Please correct me if I'm wrong - that could be just the 
 sort of thing that I'm looking for here.

Basically it should be the job of the drivers to set the uft8 flag on
data being retrieved if it is utf8. I believe that the new mysql v4.1
protocol does provide information about the characterset of each colum.
DBD::mysql can use that.

For people stuck with older versions of mysql, a driver private
option could be used to indicate that all char fields are utf8,
or have some way of indicating that per-column, such as

$sth-bind_col(1, undef, { mysql_charset = 'utf8' });

Tim.


Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Peter J. Holzer wrote:

Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL
3.23 to 4.1? 

I might be able to upgrade to 4.0 (I fact, I really ought to...), but I 
don't fancy 4.1 just yet -- it's still an alpha release :-(

- Steve



Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Peter J. Holzer wrote:

On 2003-09-10 10:40:29 +0100, Steve Hay wrote:
 

But the question was: How can I arrange for such conversions to be 
performed automatically by DBI whenever it receives or returns data?
   

You could subclass DBI or DBD::MySQL and replace all methods with
wrappers which perform the conversion. I'm not convinced that this is a
good idea, though. I'd rather try to the the conversion in application
specific layer above DBI.
Sub-classing DBI wouldn't help me since I'm using Class::DBI which won't 
know to use my DBI sub-class.

Sub-classing DBD-mysql might be more feasible, though, since I specify 
what driver to use.

Actually, I had hoped to find some appropriate hooks in Class::DBI, but 
I don't see any.  There are select and before_set triggers, but the 
latter are per-column, which would be a real pain to set up.

- Steve



Re: What to do with UTF-8 data?

2003-09-10 Thread Steve Hay
Tim Bunce wrote:

On Wed, Sep 10, 2003 at 11:42:23AM +0100, Steve Hay wrote:
 

Bart Lateur wrote:

   

On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:

 

But the question was: How can I arrange for such conversions to be 
performed automatically by DBI whenever it receives or returns data?
   

Well, there are two options... either does the dtabase somewhere stores
the flag indicating that some string is in UTF8, or you have to add that
information yourself. For the latter, I don't know if it'll actually
work, but it seems like an appropriate way to do it: add a BOM marker
at the start of the string.
 

I don't think the MySQL 3.x stores any flag to indicate that a string is 
UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql 
that would make use of it, e.g. to decode data flagged in such a way 
into Perl's internal format.

Adding a BOM myself to the string seems to have problems of its own (see 
http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not 
aware of DBI / DBD-mysql having anything in them that would make use of 
such a BOM.  Please correct me if I'm wrong - that could be just the 
sort of thing that I'm looking for here.
   

Basically it should be the job of the drivers to set the uft8 flag on
data being retrieved if it is utf8. I believe that the new mysql v4.1
protocol does provide information about the characterset of each colum.
DBD::mysql can use that.
Ah.  In that case, I should get onto the DBD-mysql people to look for 
assistance.  I was thinking that DBI itself would be adding some kind of 
UTF-8 support.

For people stuck with older versions of mysql, a driver private
option could be used to indicate that all char fields are utf8,
or have some way of indicating that per-column, such as
	$sth-bind_col(1, undef, { mysql_charset = 'utf8' });

OK, I'll pass this suggestion on to the DBD-mysql maintainer(s).

Thanks,
- Steve


Re: What to do with UTF-8 data?

2003-09-09 Thread Chuck Fox
Steve,

I am a Sybase DBA, but in situations like this, I have declared the 
column on the table to be varbinary or binary and stored the data 
directly without conversion.  Don't know if MySql supports this datatype.

HTH,

Chuck Fox
Principal Database Adminstrator
America Online, INC.
[EMAIL PROTECTED] wrote:

Hi,

Having looked through the list archives a bit, I see there has been 
lots of discussion over what to do with UTF-8 data, but I can't find 
anything that really helps me out now.

I'm running Perl 5.8.0 with MySQL 3.23.56 (via DBD-mysql).  Since (I 
think) there is no native UTF-8 support in MySQL below 4.1.x, my plan 
was to simply store the bytes of each UTF-8 character in the database 
(either by explicitly converting Perl's UTF-8 strings to sequences of 
octets using Encode::encode_utf8(), or else just letting that happen 
of its own accord as it seems to do :-s), and then convert such octet 
sequences back to UTF-8 strings when retrieving data from the database 
using Encode::decode_utf8().

As long as I store *all* my data in this way, those conversions should 
always succeed.  (encode_utf8() never fails anyway, and decode_utf8() 
will always work here because I'm always feeding it valid data.)

The problem is: How do I trap all input/output to/from DBI to do these 
conversions?

I can easily do it manually:

   $dbh-do('INSERT INTO foo (bar) VALUES (?)', undef, 
Encode::encode_utf8($input_utf8str));
   ...
   my @octets_row = $dbh-selectrow_array('SELECT bar FROM foo');
   my $output_utf8str2 = Encode::decode_utf8($octets_row[0]);

but that's way too tedious in practice.  I want to have those 
conversions done for me automatically, perhaps by having specified 
some appropriate encode/decode callbacks for certain hooks.  Do any 
such hooks exist?

If, as I suspect, there are not, then what is the currently 
recommended solution here?

Would it be at all easy to add such hooks, or some other means of 
specifying that I want these conversions done, to DBI as a stop-gap 
solution?

Thanks,
- Steve