Re: [Python-Dev] PEP 393 Summer of Code Project

Glenn Linderman Thu, 01 Sep 2011 02:58:43 -0700

On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:


  >  We can either artificially constrain ourselves to minor tweaks of
  >  the legal conforming bytestreams,

It's not artificial.  Having the internal representation be the same
as a standard encoding is very useful for a large number of minor
usages (urgently saving buffers in a text editor that knows its
internal state is inconsistent, viewing strings in the debugger, PEP
393-style space optimization is simpler if text properties are
out-of-band, etc).

saving buffers urgently when the internal state is inconsistent soundslike carefully preserving a bug. Windows 7 64-bit on one of mycomputers happily crashes several times a day when it detectsinconsistent internal state... under the theory, I guess, that losingwork is better than saving bad work. You sound the opposite.

I'm actually very grateful that Firefox and emacs recover gracefullyfrom Windows crashes, and I lose very little data from the crashes, butcannot recommend Windows 7 (this machine being my only experience withit) for stability.

In any case, the operations you mention still require the data to beprocessed, if ever so slightly, and I'll admit that a more complexrepresentation would require a bit more processing. Not clear that itwould be huge or problematical for these cases.

Except, I'm not sure how PEP 393 space optimization fits with the otheroperations. It may even be that an application-wide complex-graphemecache would save significant space, although if it uses high-bits in astring representation to reference the cache, PEP 393 would jumpimmediately to something > 16 bits per grapheme... but likely wouldanyway, if complex-graphemes are in the data stream.

  >  or we can invent a representation (whether called str or something
  >  else) that is useful and efficient in practice.

Bring on the practice, then.  You say that a bit to identify lone
surrogates might be useful or efficient.  In what application?  How
much time or space does it save?

I didn't attribute any efficiency to flagging lone surrogates (BI-5).Since Windows uses a non-validated UCS-2 or UTF-16 character type, anyPython program that obtains data from Windows APIs may be confrontedwith lone surrogates or inappropriate combining characters at any time.Round-tripping that data seems useful, even though the data itself maynot be as useful as validated Unicode characters would be. Accidentallycombining the characters due to slicing and dicing the data, and doingnormalizations, or what not, would not likely be appropriate. However,returning modified forms of it to Windows as UCS-2 or UTF-16 data maystill cause other applications to later accidentally combine thecharacters, if the modifications juxtaposed things to make them lookreasonably, even if accidentally. If intentionally, of course, the bitcould be turned off. This exact sort of problem with non-validatedUTF-8 bytes was addressed already in Python, mostly for Linux, allowinground-tripping of the byte stream, even though it is not valid. BI-6suggests a different scheme for that, without introducing lonesurrogates (which might accidentally get combined with other lonesurrogates).

You say that a bit to cache a
property might be useful or efficient.  In what application?  Which
properties?  Are those properties a set fixed by the language, or
would some bits be available for application-specific property
caching?  How much time or space does that save?

The brainstorming ideas I presented were just that... ideas. And theywere independent. And the use of many high-order bits for propertieswas one of the independent ones. When I wrote that one, I was assuminga UTF-32 representation (which wastes 11 bits of each 32). One thing Idid have in mind, with the high-order bits, for that representation, wasto flag the start or end or middle of the codes that are included in agrapheme. That would be redundant with some of the Unicode codepointproperty databases, if I understand them properly... whether it wouldmake iterators enough more efficient to be worth the complexity wouldhave to be benchmarked. After writing all those ideas down, I actuallypreferred some of the others, that achieved O(1) real grapheme indexing,rather than caching character properties.

What are the costs to applications that don't want the cache?  How is
the bit-cache affected by PEP 393?

If it is a separate type from str, then it costs nothing except theextra code space to implement the cache for those applications that dowant it... most of which wouldn't be loaded for applications that don't,if done as a module or C extension.

I know of no answers (none!) to those questions that favor
introduction of a bit-cache representation now.  And those bits aren't
going anywhere; it will always be possible to use a "wide" build and
change the representation later, if the optimization is valuable
enough.  Now, I'm aware that my experience is limited to the
implementations of one general-purpose language (Emacs Lisp) of
retricted applicability.  But its primary use *is* in text processing,
so I'm moderately expert.

*Moderately*.  Always interested in learning more, though.  If you
know of relevant use cases, I'm listening!  Even if Guido doesn't find
them convincing for Python, we might find them interesting at XEmacs.

OK... ignore the bit-cache idea (BI-1), and reread the others withouthaving your mind clogged with that one, and see if any of them makesense to you then. But you may be too biased by the "minor" needs ofkeeping the internal representation similar to the stream representationto see any value in them. I rather like BI-2, since it allow O(1)indexing of graphemes.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to