[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sat, 13 Aug 2011 13:04:37 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

David Murray <rep...@bugs.python.org> wrote:


> Tom, note that nobody is arguing that what you are requesting is a bad
> thing :)

There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding changing anything *at all* in re,
even things that to my jaded seem like actual bugs.

There are bugs, and then there are bugs.

In my survey of Unicode support across 7 programming languages for OSCON

    http://training.perl.com/OSCON/index.html

I came across a lot of weirdnesses, especially as first when the learning
curve was high.  Sure, I found it odd that unlike Java, Perl, and Ruby,
Python didn't offer regular casemapping on strings, only the simple
character-based mapping.  But that doesn't make it a bug, which is why
I filed it as an feature/enhancement request/wish, not as a bug.

I always count as bugs not handling Unicode text the way Unicode says
it must be handled.  Such things would be:

    Emitting CESU-8 when told to emit UTF-8.

    Violating the rule that UTF-8 must be in the shortest possible encoding.

    Not treating a code point as a letter when the supported version of the
    UCD says it is.  (This can happen if internal rules get out of sync
    with the current UCD.)

    Claiming one does "the expected thing on Unicode" for case-insensitive
    matches when not doing what Unicode says you must minimally do: use at
    least the simple casefolds, if not in fact the full ones.

    Saying \w matches Unicode word characters when one's definition of
    word characters differs from that of the supported version of the UCD.

Supporting Unicode vX.Y.Z is more than adding more characters.  All the
behaviors specified in the UCD have to be updated too, or else you are just
ISO 10646.  I believe some of Python's Unicode bugs happened because folks
weren't aware which things in Python were defined by the UCD or by various
UTS reports yet were not being directly tracked that way.  That's why its
important to always fully state which version of these things you follow.

Other bugs, many actually, are a result of the narrow/wide-build untransparency.

There is wiggle room in some of these.  For example, which is the one that
applies to re, in that you could -- in a sense -- remove the bug by no longer
claiming to do case-insensitive matches on Unicode.  I do not find that very
useful. Javascript works this way: it doesn't do Unicode casefolding.  Java you
have to ask nicely with the extra UNICODE_CASE flag, aka "(?u)", used with the
CASE_INSENSITIVE, aka "(?i)".

Sometimes languages provide different but equivalent interfaces to the same
functionality.  For example, you may not support the Unicode property
\p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and
hopefully also in strings.  That's just fine.  On slightly shakier ground but
still I think defensible is how one approaches support for the standard UCD
properties:

          Case_Folding    Simple_Case_Folding
     Titlecase_Mapping    Simple_Titlecase_Mapping
     Uppercase_Mapping    Simple_Uppercase_Mapping
     Lowercase_Mapping    Simple_Lowercase_Mapping

One can support folding, for example, via (?i) and not have to
directly supporting a Case_Folding property like \p{Case_Folding=s},
since "(?i)s" should be the same thing as "\p{Case_Folding=s}".

> As far as I know, Matthew is the only one currently working on the
> regex support in Python.  (Other developers will commit small fixes if
> someone proposes a patch, but no one that I've seen other than Matthew
> is working on the deeper issues.)  If you want to help out that would
> be great.

Yes, I actually would.  At least as I find time for it.  I'm a competent C
programmer and Matthew's C code is very well documented, but that's very
time consuming.  For bang-for-buck, I do best on test and doc work, making
sure things are actually working the way they say do.

I was pretty surprised and disappointed by how much trouble I had with
Unicode work in Python.  A bit of that is learning curve, a bit of it is
suboptimal defaults, but quite a bit of it is that things either don't work
the way Unicode says, or because something is altogether missing.  I'd like
to help at least make the Python documentation clearer about what it is
or is not doing in this regard.

But be warned: one reason that Java 1.7 handles Unicode more according to
the published Unicode Standard in its Character, String, and Pattern
classes is because when they said they'd be supporting Unicode 6.0.0,
I went through those classes and every time I found something in violation
of that Standard, I filed a bug report that included a documentation patch
explaining what they weren't doing right.  Rather than apply my rather
embarrassing doc patches, they instead fixed the code. :)

> And as far as this particular issue goes, yes the difference between
> the narrow and wide build has been a known issue for a long time, but
> has become less and less ignorable as unicode adoption has grown.

Little would make me happier than for Python to move to a logical character
model the way Go and Perl treat them.  I find getting bogged down by code
units to be abhorrently low level, and it doesn't matter whether these are
8-bit code units like in PHP or 16-bit code units like in Java.  The Nth
character of a string should always be its Nth logical code point not
its Nth physical code units.

In regular expressions, this is a clearly stated requirement in the Unicode
Standard (see tr18 RL1.1 and RL1.7).  However, it is more than that.  The
entire character processing model really really should be logical not
physical. That's because you need to have whole code points not broken-up
code units before you can build on the still higher-level components needed
to meet user expectations. These include user-visible graphemes (like an E
with both a combining acute and a combining up tack below) and linguistic
collating elements (like the letter <ch> in Spanish, <dzs> in Hungarian, or
<dd> in Welsh).

> Martin's PEP that Matthew references is the only proposed fix that I
> know of.  There is a GSoc project working on it, but I have no idea
> what the status is.

Other possible fixes are using UTF-8 or UTF-32.

One reason I don't like that PEP is because if you are really that
concerned wiht storage space, it is too prone to spoilers.  Neither UTF-8
nor UTF-32 have any spoilers, but that PEP does.  What I mean is that just
one lone code point in a huge string is enough to change the representation
of everything in that string.  I think of these as spoilers; strings that
are mostly non-spoilers with just a bit of spoiler in them are super super
common.  Often it's all ASCII plus just a few ISO-8859-1 or other non-ASCII
Latin characters.  Or it's all Latin with a couple of non-BMP mathematical
alphanumerics thrown in.  That kind of thing.

Consider this mail message. It contains exactly six non-ASCII code points.

    % uniwc `mhpath cur +drafts`
    Paras    Lines    Words   Graphs    Chars    Bytes File
       79      345     2796    16899    16899    16920 
/home/tchrist/Mail/drafts/1183

Because it is in UTF-8, its memory profile in bytes grows only very
slightly over its character count.  However, if you adopt the PEP, then you
pay and pay and pay, very nearly quadrupling the memory profile for six
particular characters.  Now it takes 67596 bytes intead of 16920, just for
the sake of six code points.  Ouch!!

Why would you want to do that?  You say you are worried about memory, but
then you would do this sort of thing.  I just don't understand.

I may be wrong here, not least because I can think of possible extenuating
circumstances, but it is my impression that there there is an underlying
assumption in the Python community and many others that being able to
access the Nth character in a string in constant time for arbitrary N is
the most important of all possible considerations.  I

I don't believe that makes as much sense as people think, because I don't
believe character strings really are accessed in that fashion very often
at all.  Sure, if you have a 2D matrix of strings where a given row-column
pair yields one character and you're taking the diagonal you might want
that, but how often does that actually happen?  Virtually never: these
are strings and not matrices we're running FFTs on after all.  We don't
need to be able to load them into vector registers or anything the way
the number-crunching people do.

That's because strings are a sequence of characters: they're text.  Whether
reading text left to right, right to left, or even boustrophedonically, you're
always going one past the character you're currently at.  You aren't going to
the Nth character forward or back for arbitrary N.  That isn't how people deal
with text.  Sometimes they do look at the end, or a few in from the far end,
but even that can be handled in other fashions.

I need to see firm use-case data justifying this overwhelming need for O(1)
access to the the Nth character before I will believe it.  I think it is
too rare to be as concerned with as so many people bizarrely appear to be.
This attachment has serious consequences.  It is because of this attachment
that the whole narrow/wide build thing occurs, where people are willing to
discard a clean, uniform processing model in search of what I do not
believe a reasonable or realistic goal.  Even if they *were* correct, *far*
more bugs are caused by unreliability than by performance.

 * If you were truly concerned with memory use, you would simply use UTF-8.
 * If you were truly concerned with O(1) access time, you would always use 
UTF-32.
 * Anything that isn't one of these two is some sort of messy compromise.

I promise that nobody ever had a BMP vs non-BMP bug who was working with
either UTF-8 or UTF-32.  This only happens with UTF-16 and UCS-2, which
have all the disadvantages of both UTF-8 and UTF-32 combined yet none of
the advantages of either.  It's the worst of both worlds.

Because you're using UTF-16, you're already paying quite a bit of memory for
text processing in Python compared to doing so in Go or in Perl, which are
both UTF-8 languages.  Since you're already used to paying extra, what's
so wrong with going to purely UTF-32?  That too would solve things.

UTF-8 is not the memory pig people allege it is on Asian text.  Consider:

I saved the Tokyo Wikipedia page for each of these languages as
NFC text and generated the following table comparing them. I've grouped
the languages into Western Latin, Western non-Latin, and Eastern.

   Paras Lines Words Graphs Chars  UTF16 UTF8   8:16 16:8  Language

     519  1525  6300  43120 43147  86296 44023   51% 196%  English
     343   728  1202   8623  8650  17302  9173   53% 189%  Welsh
     541  1722  9013  57377 57404 114810 59345   52% 193%  Spanish
     529  1712  9690  63871 63898 127798 67016   52% 191%  French
     321   837  2442  18999 19026  38054 21148   56% 180%  Hungarian

     202   464   976   7140  7167  14336 11848   83% 121%  Greek
     348   937  2938  21439 21467  42936 36585   85% 117%  Russian

     355   788   613   6439  6466  12934 13754  106%  94%  Chinese, simplified
     209   419   243   2163  2190   4382  3331   76% 132%  Chinese, traditional
     461  1127  1030  25341 25368  50738 65636  129%  77%  Japanese
     410   925  2955  13942 13969  27940 29561  106%  95%  Korean

Where:

  * Paras is the number of blankline-separated  text spans.
  * Lines is the number of linebreak-separated  text spans.
  * Words is the number of whitespace-separated text spans.

  * Graphs is the number of Unicode extended grapheme clusters.
  * Chars  is the number of Unicode code points.

  * UTF16 is how many bytes it takes up stored as UTF-16.
  * UTF8  is how many bytes it takes up stored as  UTF-8.

  * 8:16 is the ratio of  UTF-8 size to UTF-16 size as a percentage.
  * 16:8 is the ratio of UTF-16 size to UTF-8  size as a percentage.

  * Language is which version of the Tokyo page we're talking
    about here.

Here are my observations:

 * Western languages that use the Latin script suffer terribly upon
   conversion from UTF-8 to UTF-16, with English suffering the most
   by expanding by 96% and Hungarian the least by expanding by 80%.
   All are huge.

 * Western languages that do not use the Latin script still suffer, but
   only 15-20%.

 * Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims
   that they do!

To expand on the last point:

 * In Korean and in (simplified) Chinese, you get only 6% bigger in
   UTF-8 than in UTF-16.

 * In Japanese, you get only 29% bigger in UTF-8 than in UTF-16.

 * The traditional Chinese actually got smaller in UTF-8 than in
   UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this
   sample. If you look at the Lines and Words columns, it looks
   that this might be due to white space usage.

So UTF-8 isn't even too bad on Asian languages.

But you howl that it's variable width.  So?  You're already using
a variable-width encoding in Python on narrow builds.  I know you think
otherwise, but I'll prove this below.

Variable width isn't as bad as people claim, partly because fixed width is
not as good as they claim.  Think of the kind of operations that you
normally do on strings.  You want to to go to the next user-visible
grapheme, or to the end of the current word, or go to the start of the next
line.  UTF-32 helps you with none of those, and UTF-8 does not hurt them.
You cannot instantly go to a particular address in memory for any of those
unless you build up a table of offsets as some text editors sometimes do,
especially for lines.  You simply have to parse it out as you go.

Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
Perhaps someone could tell me why the Python documentation says it uses
UCS-2 on a narrow build.  I believe this to be an error, because otherwise
I cannot explain how you can have non-BMP code points in your UTF-8
literals in your source code.  And you clearly can.

    #!/usr/bin/env python3.2
    # -*- coding: UTF-8 -*-
    super = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"
    print(super)

This is with a narrow build on Darwin:

    % python3.2 -c 'import sys; print(sys.maxunicode)'
    65535

    % export PYTHONIOENCODING=UTF-8 PERLUNICODE=S

    % python3.2 supertest | uniwc
       Paras    Lines    Words   Graphs    Chars    Bytes File
           0        1        1        8        8       29 standard input

    % python3.2 supertest | uniquote -x
    \x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}

Observations and conclusion:

 *  You are emitting 8 code points,  7  in the SMP not in the BMP.

 *  You clearly understand code points above your alleged maxunicode value.

 *  If you were actually using UCS-2, those would not be possible.

 *  I submit that this proves you are actually using UTF-16.  Q.E.D.

Yet you are telling people you are using UCS-2.  Why is that?  Since you
are already using a variable-width encoding, why the supercilious attitude
toward UTF-8?  UTF-16 has the same properties but costs you a lot more.  As
I said before, UTF-16 puts you in the worst of all worlds.

 * If you were truly concerned with memory use, you would simply use UTF-8.
 * If you were truly concerned with O(1) access time, you would always use 
UTF-32.
 * Anything that isn't one of these two is some sort of messy compromise.

But even with UTF-16 you *can* present to the user a view of logical
characters that doesn't care about the underlying clunkish representation.
The Java regex engine proves that, since "." always matches a single code
point no matter whether it is in the BMP or not.  Similarly, ICU's classes
operate on logical characters -- code points not units -- even though they
use UTF-16 languages.  The Nth code point does not care and should not care
how many units it takes to get there.  It is fine to have both a byte
interface *and* a character interface, but I don't believe having something
that falls in between those two is of any use whatsoever.  And if you don't
have a code point interface, you don't have a character interface.

This is my biggest underlying complaint about Python's string model, but
I believe it fixable, even if doing so exceeds my own personal background.

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to