On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:
Glenn Linderman writes:

  >   From comments Guido has made, he is not interested in changing the
  >  efficiency or access methods of the str type to raise the level of
  >  support of Unicode to the composed character, or grapheme cluster
  >  concepts.

IMO, that would be a bad idea,

OK you agree with Guido.

as higher-level Unicode support should
either be a wrapper around full implementations such as ICU (or
platform support in .NET or Java), or written in pure Python at first.
Thus there is a need for an efficient array of code units type.  PEP
393 allows this to go to the level of code points, but evidently that
is inappropriate for Jython and IronPython.

  >  The str type itself can presently be used to process other
  >  character encodings:

Not really.  Remember, on input codecs always decode to Unicode and on
output they always encode from Unicode.  How do you propose to get
other encodings into the array of code units?

Here are two ways, there may be more: custom codecs, direct assignment

  >  [A "true Unicode" type] could be based on extensions to the
  >  existing str type, or it could be based on the array type, or it
  >  could based on the bytes type.  It could use an internal format of
  >  32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
  >  16-bit codeunits.

In theory yes, but in practice all of the string methods and libraries
like re operate on str (and often but not always bytes; in particular,
codecs always decode from byte and encode to bytes).

Why bother with anything except arrays of code points at the start?
PEP 393 makes that time-efficient and reasonably space-efficient as a
starting point and allows starting with re or MRAB's regex to get
basic RE functionality or good UTS #18 functionality respectively.
Plus str already has all the usual string operations (.startswith(),
.join(), etc), and we have modules for dealing with the Unicode
Character Database.  Why waste effort reintegrating with all that,
until we have common use cases that need more efficient representation?

String methods could be reimplemented on any appropriate type, of course. Rejecting alternatives too soon might make one miss the best design.

There would be some issue in coming up with an appropriate UTF-16 to
code point API for Jython and IronPython, but Terry Reedy has a rather
efficient library for that already.

Yes, Terry's implementation is interesting, and inspiring, and that concept could be extended to a variety of interesting techniques: codepoint access of code unit representations, and multi-codepoint character access on top of either code unit or codepoint representations.

So this discussion of alternative representations, including use of
high bits to represent properties, is premature optimization
... especially since we don't even have a proto-PEP specifying how
much conformance we want of this new "true Unicode" type in the first
place.

We need to focus on that before optimizing anything.

You may call it premature optimization if you like, or you can ignore the concepts and emails altogether. I call it brainstorming for ideas, looking for non-obvious solutions to the problem of representation of Unicode.

I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when processes communicate and for storage (which is one way processes communicate), that doesn't imply that the internal representation of character strings in a programming language must use exactly that representation. While there are efficiencies in using the same representation as is used by the communications streams, there are also inefficiencies. I'm unaware of any current Python implementation that has chosen to use UTF-8 as the internal representation of character strings (I'm also aware Perl has made that choice), yet UTF-8 is one of the commonly recommend character representations on the Linux platform, from what I read. So in that sense, Python has rejected the idea of using the "native" or "OS configured" representation as its internal representation. So why, then, must one choose from a repertoire of Unicode-defined stream representations if they don't meet the goal of efficient length, indexing, or slicing operations on actual characters?
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to