Re: [Python-Dev] Internal representation of strings and Micropython

Steve Dower Wed, 04 Jun 2014 08:34:21 -0700

Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?

We're discussing the behaviour under a different (hypothetical) design decision 
than a 1:1 relationship between code points and indexes, so arguing from that 
stance doesn't make much sense.

> e.g.
> 
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
> 
> That cannot fail with Python's semantics.

Agreed, and it shouldn't (I was actually referring to the optimization being 
incorrect for the goal, not the language semantics). What you'd probably find 
is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is 
also correct.

But what are you trying to achieve (why are you writing this code)? All this 
example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use 
indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always) 
> and
> the last (if possible) and as many as possible in between. How are you going 
> to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.

"Recording the index" is where the optimization comes in. With a 
variable-length encoding - heck, even with a fixed-length one - I'd just use 
str.split(' ') (or re.split('\\s', string), depending on how much I care about 
the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and 
currently it doesn't perform as well *in CPython*. But if MicroPython is 
focusing on size over speed, I don't see any reason why they shouldn't permit 
different performance characteristics and require a slightly different approach 
to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the 
Python interpreter ecosystem. Jython and IronPython already have different 
string implementations from CPython - having official (and hopefully flexible) 
guidance on deviations from the reference implementation would I think help 
other implementations provide even more value, which is only a good thing for 
Python.

Cheers,
Steve
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

Reply via email to