Re: FSR and unicode compliance - was Re: RE Module Performance

Terry Reedy Sun, 28 Jul 2013 10:40:44 -0700

On 7/28/2013 11:52 AM, Michael Torrie wrote:


3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow,


Not necessarily so. See below.

and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.


Slicing is at least O(m) where m is the length of the slice.

UTF-8, 16 and any variable-width encoding are always O(n).\

I posted about a week ago, in response to Chris A., a method by whichlookup for UTF-16 can be made O(log2 k), or perhaps more accurately,O(1+log2(k+1)), where k is the number of non-BMP chars in the string.

This uses an auxiliary array of k ints. An auxiliary array of n intswould make UFT-16 lookup O(1), but then one is using more space thanwith UFT-32. Similar comments apply to UTF-8.

The unicode standard says that a single strings should use exactly onecoding scheme. It does *not* say that all strings in an application mustuse the same scheme. I just rechecked a few days ago. It also does notsay that an application cannot associate additional data with a stringto make processing of the string easier.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Reply via email to