Re: RE Module Performance

Terry Reedy Wed, 24 Jul 2013 15:13:13 -0700

On 7/24/2013 2:15 PM, Chris Angelico wrote:

On Thu, Jul 25, 2013 at 3:52 AM, Terry Reedy <tjre...@udel.edu> wrote:

For my purpose, the mock Text works the same in 2.7 and 3.3+.


Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.

I carefully said 'For my purpose', which is to replace the tk Textwidget. Up to 8.5, Tk's text is something like Python's narrow-buildunicode.

If put astral chars into the toy editor, then yes, it would not work onnarrow builds, but would on 3.3+.


 ...

> If nobody had ever thought of doing a multi-format string

representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so.

Making a UTF-16 implementation correct requires converting abstract'character' array indexes to concrete double byte array indexes. Thesimple O(n) method of scanning the string from the beginning for eachindex operation is too slow. When PEP393 was being discussed, I deviseda much faster way to do the conversion.

The key idea is to add an auxiliary array of the abstract indexes of theastral chars in the abstract array. This is easily created when thestring is created and can be done afterward with one linear scan (whichis how I experimented with Python code). The length of that array is thenumber of surrogate pairs in the concrete 16-bit codepoint array.Subtracting that number from the length of the concrete array gives thelength of the abstract array.

Given a target index of a character in the abstract array, use theauxiliary array to determine k, the number of astral characters thatprecede the target character. That can be done with either a O(k) linearscan or O(log k) binary search. Add 2 * k to the abstract index to getthe corresponding index in the concrete array. When slicing a stringwith i0 and i1, slice the auxiliary array with k0 and k1 and adjustingthe contained indexes downward to get the corresponding auxiliary array.

To my mind, exposing UTF-16 surrogates to the application is a bug

> to be fixed, not a feature to be maintained.

It is definitely not a feature, but a proper UTF-16 implementation wouldnot expose them except to codecs, just as with the PEP 393implementation. (In both cases, I am excluding the sys size function as'exposing to the application'.)


> But since we can get the best of both worlds with only

a small amount of overhead, I really don't see why anyone should be
objecting.

I presume you are referring to the PEP 393 1-2-4 byte implementation.Given how well it has been optimized, I think it was the right choicefor Python. But a language that now uses USC2 or defective UTF-16 on allplatforms might find the auxiliary array an easier fix.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

Reply via email to