Re: Py 3.3, unicode / upper()

MRAB Thu, 20 Dec 2012 12:24:11 -0800

On 2012-12-20 19:19, wxjmfa...@gmail.com wrote:

Fact.
In order to work comfortably and with efficiency with a "scheme for
the coding of the characters", can be unicode or any coding scheme,
one has to take into account two things: 1) work with a unique set
of characters and 2) work with a contiguous block of code points.


At this point, it should be noticed I did not even wrote about
the real coding, only about characters and code points.

Now, let's take a look at what happens when one breaks the rules
above and, precisely, if one attempts to work with multiple
characters sets or if one divides - artificially - the whole range
of the unicode code points in chunks.

The first (and it should be quite obvious) consequence is that
you create bloated, unnecessary and useless code. I simplify
the flexible string representation (FSR) and will use an "ascii" /
"non-ascii" model/terminology.

If you are an "ascii" user, a FSR model has no sense. An
"ascii" user will use, per definition, only "ascii characters".

If you are a "non-ascii" user, the FSR model is also a non
sense, because you are per definition a n"on-ascii" user of
"non-ascii" character. Any optimisation for "ascii" user just
become irrelevant.

In one sense, to escape from this, you have to be at the same time
a non "ascii" user and a non "non-ascii" user. Impossible.
In both cases, a FSR model is useless and in both cases you are
forced to use bloated and unnecessary code.

The rule is to treat every character of a unique set of characters
of a coding scheme in, how to say, an "equal way". The problematic
can be seen the other way, every coding scheme has been built
to work with a unique set of characters, otherwhile it is not
properly working!

[snip]
It's true that in an ideal world you would treat all codepoints the
same. However, this is a case where "practicality beats purity".

In order to accommodate every codepoint you need 3 bytes per codepoint
(although for pragmatic reasons it's 4 bytes per codepoint).

But not all codepoints are used equally. Those in the "astral plane",
for example, are used rarely, so the vast majority of the time you
would be using twice as much memory as strictly necessary. There are
also, in reality, many times in which strings contain only ASCII-range
codepoints, although they may not be visible to the average user, being
the names of functions and attributes in program code, or tags and
attributes in HTML and XML.

FSR is a pragmatic solution to dealing with limited resources.

Would you prefer there to be a switch that makes strings always use 4
bytes per codepoint for those users and systems where memory is no
object?
--
http://mail.python.org/mailman/listinfo/python-list

Re: Py 3.3, unicode / upper()

Reply via email to