Tim Bunce <[EMAIL PROTECTED]> writes:

> But can someone summarise the causes/issues into something we can
> all understand? [I don't have time to try to do that for myself.]

Since we went through the utf8 story and I believe we now understand it
fully, I ought to at least try.

As I see it, the core of the matter is the fundamental change introduced
in Perl strings around 5.6.0. Prior to that, strings where essentially
sequences of _bytes_. Now, they are sequences of _characters_, a
character being a number between 0 and 2**32-1 (or perhaps even bigger,
can't remember off-hand).

The problem is that a lot of code (and programmers) still think of
strings as just sequences of bytes. Also, prior to Perl 5.8 things were
not really stable, ie. "man utf8" in Perl 5.6.1 says:

     WARNING: The implementation of Unicode support in Perl is
     incomplete.  See the perlunicode manpage for the exact
     details.

Since strings are now sequences of characters, the burden is on the
programmer to decide what character encoding to assume when reading
strings (ie. from files or Perl source), and when writing strings.

Reading by default assumes a null encoding (every byte is mapped to the
character with the same value), which is usually what you want, except
if you are reading text in a multibyte encoding. In the latter case, it
is necessary to decode the input into the correct characters.

Writing is where 5.6.1 IMO is really bad (5.8 is hopefully better,
haven't tried it yet). In Perl <= 5.6.1, outputting a string simply
dumps the internal representation. Thus output will be _different_ for
the "normal" single-byte representation and the "new" utf8
representation.

This is really Perl 5.6.1 IO being broken, since the internal
representation of a string should not normally be exposed to the
programmer. The problem typically occurs because some module (like
XML::Parser for example) is Unicode-enabled and uses the utf8 internal
representation in the strings it returns. An unsuspecting programmer
will suddenly see his output in UTF-8 encoding, where she expected
single-byte encoding.

Put another way, depending on the exact code path a Perl strings takes,
outputting it will somewhat arbitrarily, and pretty confusing, use
single-byte or UTF-8 encoding.

> Obviously the current situation is not good. But I need to have a
> better understanding of *exactly* what's going on at all levels
> (app code and driver) before I can think much about what needs
> fixing or documenting, and how.

Drivers should be made aware of character-semantics in Perl strings. By
default in Perl >= 5.8, possibly requiring explicit turning-on in 5.6.*.

Text data in a multi-byte encoding (characters > 255) should be returned
in character semantics, IE. using the utf8 internal encoding. Binary
data (BLOB ...) should use byte semantics.

So, inserting and retrieving the Perl string "\x{240}" in a TEXT column
should return a string, of length one, containing a single character
with code 0x240 (576) (assuming such a character is valid for the
database character set). This requires using the utf8 internal representation.

Strings containing only characters <= 255 can in principle use either
single-byte or utf8 internal encoding, but in practise I think the best
is to use single-byte internal representation when all characters are <=
127, and utf8 internal representation otherwise.


I hope this explanation helps somewhat, the issue is admittedly
complex (and the above only IMHO, of course).

 - Kristian.

-- 
Kristian Nielsen   [EMAIL PROTECTED]
Development Manager, Sifira A/S

Reply via email to