Glenn Linderman writes: > Windows 7 64-bit on one of my computers happily crashes several > times a day when it detects inconsistent internal state... under > the theory, I guess, that losing work is better than saving bad > work. You sound the opposite.
Definitely. Windows apps habitually overwrite existing work; saving when inconsistent would be a bad idea. The apps I work on dump their unsaved buffers to new files, and give you a chance to look at them before instating them as the current version when you restart. > Except, I'm not sure how PEP 393 space optimization fits with the other > operations. It may even be that an application-wide complex-grapheme > cache would save significant space, although if it uses high-bits in a > string representation to reference the cache, PEP 393 would jump > immediately to something > 16 bits per grapheme... but likely would > anyway, if complex-graphemes are in the data stream. The only language I know of that uses thousands of complex graphemes is Korean ... and the precomposed forms are already in the BMP. I don't know how many accented forms you're likely to see in Vietnamese, but I suspect it's less than 6400 (the number of characters in private space in the BMP). So for most applications, I believe that mapping both non-BMP code points and grapheme clusters into that private space should be feasible. The only potential counterexample I can think of is display of Arabic, which I have heard has thousands of glyphs in good fonts because of the various ways ligatures form in that script. However AFAIK no apps encode these as characters; I'm just admitting that it *might* be useful. This will require some care in registering such characters and clusters because input text may already use private space according to some convention, which would need to be respected. Still, 6400 characters is a lot, even for the Japanese (IIRC the combined repertoire of "corporate characters" that for some reason never made it into the JIS sets is about 600, but almost all of them are already in the BMP). I believe the total number of Japanese emoticons is about 200, but I doubt that any given text is likely to use more than a few. So I think there's plenty of space there. This has a few advantages: (1) since these are real characters, all Unicode algorithms will apply as long as the appropriate properties are applied to the character in the database, and (2) it works with a narrow code unit (specifically, UCS-2, but it could also be used with UTF-8). If you really need more than 6400 grapheme clusters, promote to UTF-32, and get two more whole planes full (about 130,000 code points). > I didn't attribute any efficiency to flagging lone surrogates (BI-5). > Since Windows uses a non-validated UCS-2 or UTF-16 character type, any > Python program that obtains data from Windows APIs may be confronted > with lone surrogates or inappropriate combining characters at any > time. I don't think so. AFAIK all that data must pass through a codec, which will validate it unless you specifically tell it not to. > Round-tripping that data seems useful, The standard doesn't forbid that. (ISTR it did so in the past, but what is required in 6.0 is a specific algorithm for identifying well-formed portions of the text, basically "if you're currently in an invalid region, read individual code units and attempt to assemble a valid sequence -- as soon as you do, that is a valid code point, and you switch into valid state and return to the normal algorithm".) Specifically, since surrogates are not characters, leaving them in the data does not constitute "interpreting them as characters." I don't recall if any of the error handlers allow this, though. > However, returning modified forms of it to Windows as UCS-2 or > UTF-16 data may still cause other applications to later > accidentally combine the characters, if the modifications > juxtaposed things to make them look reasonably, even if > accidentally. In CPython AFAIK (I don't do Windows) this can only happen if you use a non-default error setting in the output codec. > After writing all those ideas down, I actually preferred some of > the others, that achieved O(1) real grapheme indexing, rather than > caching character properties. If you need O(1) grapheme indexing, use of private space seems a winner to me. It's just defining private precombined characters, and they won't bother any Unicode application, even if they leak out. > > What are the costs to applications that don't want the cache? > > How is the bit-cache affected by PEP 393? > > If it is a separate type from str, then it costs nothing except the > extra code space to implement the cache for those applications that > do want it... most of which wouldn't be loaded for applications > that don't, if done as a module or C extension. I'm talking about the bit-cache (which all of your BI-N referred to, at least indirectly). Many applications will want to work with fully composed characters, whether they're represented in a single code point or not. But they may not care about any of the bit-cache ideas. > OK... ignore the bit-cache idea (BI-1), and reread the others without > having your mind clogged with that one, and see if any of them make > sense to you then. But you may be too biased by the "minor" needs of > keeping the internal representation similar to the stream representation > to see any value in them. No, I'm biased by the fact that I already good ways to do them without leaving the set of representations provided by Unicode (often ways which provide additional advantages), and by the fact that I myself don't know any use cases for the bit-cache yet. > I rather like BI-2, since it allow O(1) indexing of graphemes. I do too (without suggesting a non-standard representation, ie, using private space), but I'm sure that wheel has been reinvented quite frequently. It's a very common trick in text processing, although I don't know of other applications where it's specifically used to turn data that "fails to be an array just a little bit" into a true array (although I suppose you could view fixed-width EUC encodings that way). _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com