Re: need help with utf-8

Peter J. Holzer Thu, 19 Dec 2024 06:44:17 -0800

On 2024-12-18 09:41:13 -0500, Shaomei Liu wrote:
> if you happen to have an example to show the new behavior is more
> "correct",

I haven't been on any of the main Perl mailing-lists or newsgroups for a
long time, so this may be outdated, but the general idea is that the
dichotomy between byte strings and character strings was a mistake and
that two strings which compare equal should hehave the same whenever
possible. The difference is just too subtle and error-prone.

In particular, the string you created in your test script was a byte
string with three bytes ("\xe2\x80\x9C"). That string has length 3 and
it will compare equal to the string with the three characters 
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX, U+0080 PADDING CHARACTER,
U+009C STRING TERMINATOR. So it stands to reason that it should be
treated the same as that 3 character string, and the varchar stored in
the database should also be 3 characters long and not just a single
character, just because it happens to be a byte sequence which happens
to match that character's UTF-8 encoding.

> On Wed, Dec 18, 2024 at 8:53 AM Felipe Gasper <fel...@felipegasper.com> wrote:
> >
> > Do we know, in fact, why this changed?
> >
> > The new behaviour may be “more correct”, but it’ll still subtly
> > break a bunch of stuff that worked fine before.

True.

But it should probably also be noted that Redhat 7 was released in 2014 and
Redhat 8 in 2019. So the "new behaviour" is now between 5 and 10 years
old.

I'm too lazy to track down the release which introduced the change
(especially since there seems to be a huge gap in the history on CPAN),
but I would expect that to be mentioned in the release notes at the
time.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | h...@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc
Description: PGP signature

Re: need help with utf-8

Reply via email to