On 2008-01-10 20:03:13 +0300, Tomash Brechko wrote: > On Thu, Jan 10, 2008 at 17:29:42 +0100, Peter J. Holzer wrote: > > The same byte sequence, but not the same value. In C (on many systems) > > the single precision floating point number 3.1415927 and the integer > > 1078530011 have the same byte sequence (0xdb 0xf 0x49 0x40 on little > > endian systems), but they hardly have the same value. > > OK, I've got your point, though it's more a question of a terminology. > > Let me put it another way: my opinion is that C::M (and C::M::F) > itself should not save/restore UTF-8 flag. Instead, it should work > the same way other Perl data streams work. If you write a string to a > file, no magic flags are stored somewhere. Instead, when you _read_ > it back you say, "alright, please set an UTF-8 flag on the data if it > looks like UTF-8 string".
Nope. When you read a file, you specify the encoding the file should be
in. For all the encodings except raw, the stream is then actually
decoded and a converted to perl character strings: If the file doesn't
match the specified encoding, an exception thrown, there is no "if it
looks like" involved, and the utf8 flag is always set.
(Perl is more forgiving on output: If the current character cannot be
represented in the output encoding, the I/O layer substitutes it instead
of throwing an exception)
> DBI works the same way (yes, DBD backends
> actually, thanks for pointing that, but this doesn't make much
> difference).
I hope not. DBD::Oracle converts from and to perl character strings if
the local character set (in NLS_LANG) is some variation of UTF-8.
Otherwise it converts from and to the local character set and expects
and delivers byte strings. No guessing involved. I think DBD::mysql has
some flag to control character vs. byte strings, but I haven't used that
lately. (We are talking only about varchar and clob types here - of
course a blob must never be converted)
> Actually, it's possible to store this flag in memcached, and _when
> asked_ to set UTF-8 back, no string scan would be necessary to see if
> the string is really in UTF-8.
I really think that:
my $var = "some arbitrary string";
$memcached->set("key", $var);
my $var2 = $memcached->get("key");
is($var, $var2);
should always succeed. That can be done by always encoding and decoding
it or by storing the flag and always honoring it if it is present.
> However, I think such optimization is
> not worth the risk of missing some UTF-8 data that was uploaded though
> some other memcached client that doesn't set any special flag, or of
> setting UTF-8 flag on the string that was messed with append/prepend.
Right. Requiring the programmer to do something special is prone to
errors. C::M should handle all perl data types transparently.
Of course if you have different clients accessing the data you need to
specify the exact format anyway - A python client probably can't decode
Perl's Storable format.
Append/prepend may be a problem. But that can be left to the
application - appending a byte string to a character string is a type
error, just like adding a length to an area is - perl lets you get away
with both, but the result won't be sensible.
> You correctly pointed that this flag is part of Perl's internals, so
> it's better not to set it without additional precautions.
I think there are two layers:
1) Perl knows two types of strings: Byte strings and character strings.
The elements of byte strings can be mapped to 8-bit numbers, the
elements of character strings can be mapped to 32-bit numbers.
There are also some differences related to character classes, etc.
Whether a given scalar is a character string or a byte string can be
determined with the (badly named) utf8::is_utf8 function.
This is the conceptional model.
2) perl character strings are actually stored in UTF-8 format, and there
is a flag in the PV structure to distinguish character strings from
byte strings, which can be manipulated.
These are implementation details.
I think it is perfectly ok (and even necessary) to take 1) into account
but one shouldn't rely on 2) unless necessary.
hp
--
_ | Peter J. Holzer | It took a genius to create [TeX],
|_|_) | Sysadmin WSR | and it takes a genius to maintain it.
| | | [EMAIL PROTECTED] | That's not engineering, that's art.
__/ | http://www.hjp.at/ | -- David Kastrup in comp.text.tex
signature.asc
Description: Digital signature
