Re: [PATCH] utf8 flag support on perl lib

Peter J. Holzer Thu, 10 Jan 2008 08:30:11 -0800

On 2008-01-10 15:56:08 +0300, Tomash Brechko wrote:
> On Thu, Jan 10, 2008 at 12:49:28 +0100, Peter J. Holzer wrote:
> > > Cache::Memcached::Fast doesn't preserve UTF-8 and tainted flags
> > > either.
> > 
> > Losing the utf8 flag changes the value,
> 
> Not really.  If you had an UTF-8 string, and reset the UTF-8 flag, you
> still have the same byte sequence.


The same byte sequence, but not the same value. In C (on many systems)
the single precision floating point number 3.1415927 and the integer
1078530011 have the same byte sequence (0xdb 0xf 0x49 0x40 on little
endian systems), but they hardly have the same value.

Similarily in Perl, the byte sequence 0x4B 0xC3 0xA4 0x73 0x65
can represent the four character string <LATIN CAPITAL LETTER A>
<LATIN SMALL LETTER A WITH DIAERESIS> <LATIN SMALL LETTER S>
<LATIN SMALL LETTER E>, but the same byte sequence can represent the
five byte string <LATIN CAPITAL LETTER A> <unspecified byte value 0xC3>
<unspecified byte value 0xA4> <LATIN SMALL LETTER S> <LATIN SMALL LETTER
E>. They don't compare equal and they act differently in most
situations. As a Perl programmer, you shouldn't even know that they are
represented by the same byte sequence - that's an implementation detail
which might change.


> It's only now Perl would treat these bytes as such, thus length($str)
> and regexp classes won't work character-wise.  You may call
> Encode::_utf8_on($str), and everything would get to normal.

_utf8_on changes the type of a variable without changing the bits stored
in the variable. It's roughly similar to accessing a variable of one
type via a pointer of another type in C. You should not do that. There's
a reason this function starts with an underscore and is marked as
"[INTERNAL]".

If you need to convert a UTF-8 encoded byte string to a character
string, use decode_utf8.

(Yes, I've seen that the patch uses _utf8_on. One can argue that the
effects of send and _utf8_on are known for any given perl version and
that there are test cases to guard against implementation changes, but I
wouldn't do it this way unless benchmarking shows that there's a
significant performance improvement)


> The reason this "feature" of C::M is seldom noticed is that most
> scripts just pass data back and forth, not performing any
> character-wise manipulations on it.

The reason is probably that they pass byte strings. Otherwise they
would notice that the value changes even without doing character
manipulations.

> The scripts that do care may call utf8::upgrade() (or
> Encode::_utf8_on() when sure).

You do know that these functions do very different things?

> Note that it's actually dangerous to enable UTF-8 flag in scripts that
> do not "use utf8;" or "use encoding 'utf-8';".

That's wrong.  "use utf8" declares that the source code is UTF-8. It
doesn't have anything to do with how the script works at run-time.
"use encoding 'utf-8'" additionally has some run-time effects (e.g., it
sets I/O layers on STDIN and STDOUT). Neither is necessary
for the handling of character strings. You may have them confused with
"use bytes".


> There actually may be a mix of scripts each using it's own encoding
> (most often case is when the script does not use any encoding and
> treat everything as bytes).  That's why DBI enables UTF-8 only when
> requested.

AFAIK DBI does no such thing. You may think of some specific DBD driver.


> > > Besides, more often than not you use memcached client together with
> > > some other means to get the data if it's missing from the cache.
> > > While you may fix C::M(::F), not every other backend preserves these
> > > flags automatically.
> > 
> > These are the primary means for storing your data. If they can't handle
> > your data you've got a problem :-). When you design your application you
> > know (or should know) what data you want to store and choose your data
> > model and storage system accordingly. If MySQL can't do it, use Oracle
> > (or vice versa); if a varchar column can't do it, use a blob; etc.
> 
> Being UTF-8 or not is not a property of the data, but a property of
> how you work with this data,

That's exactly why I don't like that all these perl functions which deal
with character strings have "utf8" in the name. It gives people the idea
that "UTF-8" has anything to do conceptually with perl character
strings. It doesn't. The fact that perl stores character strings
internally as UTF-8 is an implementation detail. A character is just a
character and it doesn't matter whether its stored in 1, 2, 3 or 4
bytes.

> so MySQL or Oracle has nothing to do with this.

It does. If I want to store some data in a database I need to choose a
column type which can represent the data. If I want to store an "ä" I
need to choose a character set for that column (or the whole database)
which includes that character. I do not have to know (or at least I
shouldn't have to know) the actual bit pattern used to store that
character. That's the job of the database engine and the DBD driver. But
I do have to know the limits (e.g., that I cannot store a Euro sign in a
Latin-1 varchar column, or a string of 30000 characters length in an
Oracle varchar2 column, or the number 10000000000 in a MySQL int
column).

Of course I have to know the limits for Cache::Memcached also. But the
only documented limit is the maximum length of 1 MB. Every Perl data
structure shorter than this is supposed to be storable and retrievable.


> Just enable the flag if you want to work with characters, do nothing
> if bytes are fine.

No. Don't ever "just enable the flag", unless you are doing really
low-level stuff (i.e., messing with perl internals). Think of "character
strings" and "byte strings" as two different types. Normally you want
encode and decode to get one from the other, rarely upgrade and
downgrade.

> But it's up to the script to decide which one is desired.

Right. You should know whether you are dealing with bytes or characters
at any point in your script.

        hp

-- 
   _  | Peter J. Holzer    | It took a genius to create [TeX],
|_|_) | Sysadmin WSR       | and it takes a genius to maintain it.
| |   | [EMAIL PROTECTED]         | That's not engineering, that's art.
__/   | http://www.hjp.at/ |    -- David Kastrup in comp.text.tex

signature.asc
Description: Digital signature

Re: [PATCH] utf8 flag support on perl lib

Reply via email to