Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?
We're discussing the behaviour under a different (hypothetical) design decision
than a 1:1 relationship between code points and indexes, so arguing from that
stance doesn't make much sense.
> e.g.
>
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
>
> That cannot fail with Python's semantics.
Agreed, and it shouldn't (I was actually referring to the optimization being
incorrect for the goal, not the language semantics). What you'd probably find
is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is
also correct.
But what are you trying to achieve (why are you writing this code)? All this
example really shows is that you're only using indexing for trivial purposes.
Chris's example of an actual case where it may look like a good idea to use
indexing for optimization makes this more obvious IMHO:
Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always)
> and
> the last (if possible) and as many as possible in between. How are you going
> to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.
"Recording the index" is where the optimization comes in. With a
variable-length encoding - heck, even with a fixed-length one - I'd just use
str.split(' ') (or re.split('\\s', string), depending on how much I care about
the type of delimiter) and manipulate the list.
If copying into a separate list is a problem (memory-wise), re.finditer('\\S+',
string) also provides the same behaviour and gives me the sliced string, so
there's no need to index for anything.
The downside is that it isn't as easy to teach as the 1:1 relationship, and
currently it doesn't perform as well *in CPython*. But if MicroPython is
focusing on size over speed, I don't see any reason why they shouldn't permit
different performance characteristics and require a slightly different approach
to highly-optimized coding.
In any case, this is an interesting discussion with a genuine effect on the
Python interpreter ecosystem. Jython and IronPython already have different
string implementations from CPython - having official (and hopefully flexible)
guidance on deviations from the reference implementation would I think help
other implementations provide even more value, which is only a good thing for
Python.
Cheers,
Steve
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com