Jim Jewett added the comment:
I think the new wording is an improvement, but keeping the changes minimal left
it in an awkward in-between state.
Proposal:
A string is a sequence of Unicode code points. Strings can include any
sequence of code points, including some which are semantically
Jim Jewett added the comment:
And even my rewrite showed path dependency; a slight further improvement is to
re-order encoding ahead of bytes. I also added a paragraph that I hope answers
the speed issue.
Proposal:
A string is a sequence of Unicode code points. Strings can include any
Roundup Robot added the comment:
New changeset 6ffb6909c439 by Nick Coghlan in branch '3.4':
Issue #21667: Clarify string data model description
http://hg.python.org/cpython/rev/6ffb6909c439
New changeset 7c120e77d6f7 by Nick Coghlan in branch 'default':
Merge issue #21667 from 3.4
Nick Coghlan added the comment:
I've merged the character-code point clarifications, without the
implementation detail section.
For the time being, that leaves doesn't provide O(1) indexing of strings as
the kind of discrepancy that often makes an appearance in differences from the
CPython
New submission from Nick Coghlan:
Based on the recent python-dev thread, I propose the following CPython
implementation detail note in the Strings entry of
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
CPython currently guarantees O(1) access to arbitrary code
STINNER Victor added the comment:
str[a:b] returns a substring (characters), not an array of code points
(numbers).
--
nosy: +haypo
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
Nick Coghlan added the comment:
Guido, I think we need your call on whether or not to add a note about string
indexing algorithmic complexity to the language reference, and to approve the
exact wording of such a note (my proposed wording is in my initial comment on
this issue).
--
Nick Coghlan added the comment:
No, Python doesn't expose Unicode characters in its data model at all, except
in those cases where a code point happens to correspond directly with a
character. A length 1 str instance represents a Unicode code point, not a
Unicode character.
--
Nick Coghlan added the comment:
Although, you're right, that section of the data model docs misuses the word
character to mean something other than what it means in the Unicode spec :(
--
___
Python tracker rep...@bugs.python.org
STINNER Victor added the comment:
Python implementations are required to ...
By the way, Python 3.3 doesn't implement this requirement :-)
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
Nick Coghlan added the comment:
Saying that ord() and chr() switch between characters and code points is just
plain wrong, since characters may be represented as multiple code points.
We may also want to explicitly note that the Unicode normalisation is
implementation dependendent, and that
Nick Coghlan added the comment:
Right, narrow builds have long been broken - that's a large part of why this is
now the requirement :)
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
Nick Coghlan added the comment:
Patch attached that also addresses the characters vs code points confusion.
--
Added file:
http://bugs.python.org/file35489/issue21667_clarify_str_specification.rst
___
Python tracker rep...@bugs.python.org
Nick Coghlan added the comment:
I ducked the Unicode normalisation question for now, since that's a *different*
can of worms :)
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
Antoine Pitrou added the comment:
Two things:
- I don't think it's very helpful to use the term code point without
explaining or introducing it (character at least can be understood
intuitively)
- The mention of slicing is ambiguous: is slicing suppoded to be O(1)? how is
indexing related to
Nick Coghlan added the comment:
If someone doesn't understand what Unicode code point means, that's going to
be the least of their problems when it comes to implementing a conformant
Python implementation. We could link to
http://unicode.org/glossary/#code_point, but that doesn't really add
Antoine Pitrou added the comment:
Not sure what implementing a conformant Python implementation has to do with
this; the language specification should be readable by any interested
programmers, IMO.
If you try to dive into the formal Unicode spec instead, you end up
in a twisty maze of
Serhiy Storchaka added the comment:
Then perhaps we need notes about algorithmic complexity of bytes, bytearray,
list and tuple and dict indexing, set.add and set.discard, dict.__delitem__,
list.pop, len(), + and += for all basic sequences and containers, memoryview()
for bytes, bytearray and
STINNER Victor added the comment:
Then perhaps we need notes about algorithmic complexity of bytes, bytearray,
list and tuple and dict indexing, set.add and set.discard, dict.__delitem__,
list.pop, len(), + and += for all basic sequences and containers,
memoryview() for bytes, bytearray
Changes by Chris Angelico ros...@gmail.com:
--
nosy: +Rosuav
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing
Guido van Rossum added the comment:
I don't want the O(1) property explicitly denounced in the reference manual.
It's fine if the manual is silent on this -- maybe someone can prove that it
isn't a problem based on benchmarks of an alternate implementation, but until
then, I'm skeptical --
21 matches
Mail list logo