Re: [Python-Dev] PEP 393 Summer of Code Project

Glenn Linderman Wed, 31 Aug 2011 12:25:16 -0700

On 8/31/2011 5:21 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:


  >   From comments Guido has made, he is not interested in changing the
  >  efficiency or access methods of the str type to raise the level of
  >  support of Unicode to the composed character, or grapheme cluster
  >  concepts.

IMO, that would be a bad idea,


OK you agree with Guido.

as higher-level Unicode support should
either be a wrapper around full implementations such as ICU (or
platform support in .NET or Java), or written in pure Python at first.
Thus there is a need for an efficient array of code units type.  PEP
393 allows this to go to the level of code points, but evidently that
is inappropriate for Jython and IronPython.

  >  The str type itself can presently be used to process other
  >  character encodings:

Not really.  Remember, on input codecs always decode to Unicode and on
output they always encode from Unicode.  How do you propose to get
other encodings into the array of code units?


Here are two ways, there may be more: custom codecs, direct assignment

  >  [A "true Unicode" type] could be based on extensions to the
  >  existing str type, or it could be based on the array type, or it
  >  could based on the bytes type.  It could use an internal format of
  >  32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
  >  16-bit codeunits.

In theory yes, but in practice all of the string methods and libraries
like re operate on str (and often but not always bytes; in particular,
codecs always decode from byte and encode to bytes).

Why bother with anything except arrays of code points at the start?
PEP 393 makes that time-efficient and reasonably space-efficient as a
starting point and allows starting with re or MRAB's regex to get
basic RE functionality or good UTS #18 functionality respectively.
Plus str already has all the usual string operations (.startswith(),
.join(), etc), and we have modules for dealing with the Unicode
Character Database.  Why waste effort reintegrating with all that,
until we have common use cases that need more efficient representation?

String methods could be reimplemented on any appropriate type, ofcourse. Rejecting alternatives too soon might make one miss the bestdesign.

There would be some issue in coming up with an appropriate UTF-16 to
code point API for Jython and IronPython, but Terry Reedy has a rather
efficient library for that already.

Yes, Terry's implementation is interesting, and inspiring, and thatconcept could be extended to a variety of interesting techniques:codepoint access of code unit representations, and multi-codepointcharacter access on top of either code unit or codepoint representations.

So this discussion of alternative representations, including use of
high bits to represent properties, is premature optimization
... especially since we don't even have a proto-PEP specifying how
much conformance we want of this new "true Unicode" type in the first
place.

We need to focus on that before optimizing anything.

You may call it premature optimization if you like, or you can ignorethe concepts and emails altogether. I call it brainstorming for ideas,looking for non-obvious solutions to the problem of representation ofUnicode.

I found your discussion of streams versus arrays, as separate conceptsrelated to Unicode, along with Terry's bisect indexing implementation,to rather inspiring. Just because Unicode defines streams of codeunitsof various sizes (UTF-8, UTF-16, UTF-32) to represent characters whenprocesses communicate and for storage (which is one way processescommunicate), that doesn't imply that the internal representation ofcharacter strings in a programming language must use exactly thatrepresentation. While there are efficiencies in using the samerepresentation as is used by the communications streams, there are alsoinefficiencies. I'm unaware of any current Python implementation thathas chosen to use UTF-8 as the internal representation of characterstrings (I'm also aware Perl has made that choice), yet UTF-8 is one ofthe commonly recommend character representations on the Linux platform,from what I read. So in that sense, Python has rejected the idea ofusing the "native" or "OS configured" representation as its internalrepresentation. So why, then, must one choose from a repertoire ofUnicode-defined stream representations if they don't meet the goal ofefficient length, indexing, or slicing operations on actual characters?

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to