Re: [PATCH] utf8 flag support on perl lib

Peter J. Holzer Sat, 12 Jan 2008 15:25:05 -0800

On 2008-01-12 19:09:08 +0300, Tomash Brechko wrote:
> On Sat, Jan 12, 2008 at 13:38:38 +0100, Peter J. Holzer wrote:
> > * A floating point value: Same as the integers, except that on at least
> >   some perl implementations the FP -> string -> FP conversion may lose
> >   some precision (but that's a bug in perl).
> 
> It's not a bug in Perl, but common user misunderstanding.


I think it is a bug in perl - a rather simple off-by-one error to boot.
In some circumstances the FP to string conversion stops one digit to
soon - if it would include the next digit in the result, the error would
be less than eps/2, and the string to FP conversion would result in the
same value.

> Being dynamically typed language is different from having automatic
> conversion into strings.

I didn't mean to imply that these concepts have anything to do with each
other. They are orthogonal, Perl just happens to have both.

> When you store floats to file you can either stringify it, or pack()
> it, depending on what you are trying to achieve.  _Stringification_ is
> simply not a substitute for _serialization_.

"Stringification" is a rather vague term which can mean very different
things. In the case of stringification of numeric types in Perl it is a
type of serialization - not the only type available, but the only type
which is built directly into the language and which is portable between
different perl implementations (pack("D") isn't). So it should work
correctly. But this is off-topic for this list, so I won't discuss this
further - I shouldn't even have mentioned it.

> C::M could just use Storable in all cases, but there
> are actually two APIs in it: one is raw, byte-oriented, Perl-unaware
> (passing scalar _buffers_), and one is Perl aware (passing references,
> including references to scalars, they are not an exception).  It's a
> mere coincidence that some scalar types are automatically converted to
> such "octet buffers".

This is not coincidence. It is a basic feature of the perl programming
language.

> And that's why proposed solution has to call Perl internal
> functions---

It doesn't, actually. It can be done with "official" functions (and at
least in 5.10.0, they aren't marked as "experimental" any more either).

> it solves the wrong problem in a wrong way.
> 
> 
> But this misunderstanding is very common in Perl world, and I myself
> put $memd->set('key', 123); in example section of C::M::F (thankfully
> it isn't 1.23 :)).

Right. The difference between your "two APIs" just isn't visible to the
user - it's the same function, and the documentation treats it the same.
So how should the user know there's a difference if even the author
seems to be unaware of it?


> And since the reality is how one observers it, there's no point in
> trying to change this.  If there are users who demand the
> functionality, let's have it, wrong as it is ;).  As long as it is
> disabled by default everyone should be satisfied.
> 
> I'm adding the following to my TODO list for C::M::F (can't help with
> C::M, sorry):
> 
>   Add constructor parameter
> 
>     encoding => 'preserve' | 'force' (default: undef, i.e. neither)
> 
> 
> 'preserve' would mean "save on store and restore on fetch",

How do you plan to save this information? In a flag or inline? If the
latter, how can this be recognized reliably?

> 'force' would mean "forcefully make Perl think it's a text string"

I think this is a bad idea. If an application knows that some value is
stored as a UTF-8 sequence it should call decode_utf8 itself. It should
not rely on some heuristics.

> (this would be needed for the scenarios I outlined earlier, i.e. when
> the side that does the store doesn't set any flag but fetching side is
> confident that fetched data is the text in the right encoding).

I think this very bad application design. I see no reason to encourage
this. But it's your library, you can implement what you want.


> Peter, could you please enlighten me with the expert opinion if I should
> use Encode::is_utf8 and Encode::_utf8_on like Tatsuki-san did, or have
> I use utf8::is_utf8(), utf8::upgrade(), utf8::downgrade(), to get
> _any_ encoding work, not just UTF-8?

I'm not sure if I understand what you mean by getting any encoding to
work. If an application wants to store strings in arbitrary encodings in
memcached and restore them including the encoding information, it should
just store a tupel with the two values:

$memcache->set([ "iso-8859-1", "K\x{E4}se" ]);

The knowledge that a byte string is text in a specific encoding resides
only in the application - perl knows nothing about it, so
Cache::Memcached cannot act on it - the application must explicitely
store the info.

OTOH, Perl does have "character strings". We aren't supposed to know how
they are stored internally, but we know that we can encode them as UTF-8
and decode them again without loss of information.

So we can do something like this:

in set:
    if (utf8::is_utf8($val)) {
        utf8::encode($val);
        $flags |= F_UTF
    }

in get:
    if ($flags & F_UTF8) {
        utf8::decode($val) or die "oops!";
    }

TIMTOWTDI, of course.

I would prefer utf8::is_utf8 to Encode::is_utf8, because the latter is
marked as internal, while the former is not (OTOH, Encode::is_utf8 was
introduced in 5.8.0, and utf8::is_utf8 only in 5.8.1, so Encode::is_utf8
is currently more portable than utf8::is_utf8).

For the conversion, I think it's a matter of taste whether you use
utf8::encode or Encode::encode_utf8 (or even  Encode::encode('UTF-8',
...)). I would avoid relying on Perl internals - so don't expect send to
convert to UTF-8 and don't use Encode::_utf8_on (unless you can
demonstrate that doing so really avoids a performance bottleneck).

I don't see how utf8::upgrade() and utf8::downgrade() could be used here
- they do something different.

        hp

-- 
   _  | Peter J. Holzer    | It took a genius to create [TeX],
|_|_) | Sysadmin WSR       | and it takes a genius to maintain it.
| |   | [EMAIL PROTECTED]         | That's not engineering, that's art.
__/   | http://www.hjp.at/ |    -- David Kastrup in comp.text.tex

signature.asc
Description: Digital signature

Re: [PATCH] utf8 flag support on perl lib

Reply via email to