Steven D'Aprano:

Using variable-sized strings like UTF-8 and UTF-16 for in-memory
representations is a terrible idea because you can't assume that people
will only every want to index the first or last character. On average,
you need to scan half the string, one character at a time. In Big-Oh, we
can ignore the factor of 1/2 and just say we scan the string, O(N).

In the majority of cases you can remove excessive scanning by caching the most recent index->offset result. If the next index request is nearer the cached index than to the beginning then iterate from that offset. This converts many operations from quadratic to linear. Locality of reference is common and can often be reasonably exploited.

However, exposing the variable length nature of UTF-8 allows the application to choose efficient techniques for more cases.

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to