[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-10 Thread Jim Jewett

Jim Jewett added the comment:

I think the new wording is an improvement, but keeping the changes minimal left 
it in an awkward in-between state.

Proposal:

A string is a sequence of Unicode code points.  Strings can include any 
sequence of code points, including some which are semantically meaningless, or 
explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented 
as a string of length ``1``.  The built-in function :func:`chr` translates an 
integer in the range ``U+ - U+10`` to the corresponding length ``1`` 
string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (as :class:`bytes` in the 
given text encoding) suitable for transport and communication with non-Python 
utilities.  :meth:`bytes.decode` decodes such byte sequences into text strings.

--
nosy: +Jim.Jewett

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-10 Thread Jim Jewett

Jim Jewett added the comment:

And even my rewrite showed path dependency; a slight further improvement is to 
re-order encoding ahead of bytes.  I also added a paragraph that I hope answers 
the speed issue.

Proposal:

A string is a sequence of Unicode code points.  Strings can include any 
sequence of code points, including some which are semantically meaningless, or 
explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented 
as a string of length ``1``.  The built-in function :func:`chr` translates an 
integer in the range ``U+ - U+10`` to the corresponding length ``1`` 
string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (in the given text 
encoding) as a :class:`bytes` object suitable for transport and communication 
with non-Python utilities.  :meth:`bytes.decode` decodes such byte sequences 
into text strings.

.. impl-detail::  There are no methods exposing the internal representation of 
code points within a string.  While the C-API provides some additional 
constraints on CPython, other implementations are free to use any 
representation that treats code points (as opposed to either code units or some 
normalized form of characters) as the unit of measure.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-07 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 6ffb6909c439 by Nick Coghlan in branch '3.4':
Issue #21667: Clarify string data model description
http://hg.python.org/cpython/rev/6ffb6909c439

New changeset 7c120e77d6f7 by Nick Coghlan in branch 'default':
Merge issue #21667 from 3.4
http://hg.python.org/cpython/rev/7c120e77d6f7

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-07 Thread Nick Coghlan

Nick Coghlan added the comment:

I've merged the character-code point clarifications, without the 
implementation detail section.

For the time being, that leaves doesn't provide O(1) indexing of strings as 
the kind of discrepancy that often makes an appearance in differences from the 
CPython reference implementation section that many alternative implementations 
include.

--
resolution:  - later
stage:  - resolved
status: open - closed
type:  - enhancement

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

New submission from Nick Coghlan:

Based on the recent python-dev thread, I propose the following CPython 
implementation detail note in the Strings entry of 
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types

CPython currently guarantees O(1) access to arbitrary code points when 
indexing and slicing a string. Python implementations are required to index and 
slice strings as arrays of code points, but are not required to guarantee O(1) 
access to arbitrary locations within the string. This allows implementations to 
use variable width encodings for their internal string representation.

--
assignee: docs@python
components: Documentation
messages: 219793
nosy: docs@python, ncoghlan
priority: normal
severity: normal
status: open
title: Clarify status of O(1) indexing semantics of str objects
versions: Python 3.4, Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread STINNER Victor

STINNER Victor added the comment:

str[a:b] returns a substring (characters), not an array of code points 
(numbers).

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Guido, I think we need your call on whether or not to add a note about string 
indexing algorithmic complexity to the language reference, and to approve the 
exact wording of such a note (my proposed wording is in my initial comment on 
this issue).

--
nosy: +gvanrossum

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

No, Python doesn't expose Unicode characters in its data model at all, except 
in those cases where a code point happens to correspond directly with a 
character. A length 1 str instance represents a Unicode code point, not a 
Unicode character.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Although, you're right, that section of the data model docs misuses the word 
character to mean something other than what it means in the Unicode spec :(

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread STINNER Victor

STINNER Victor added the comment:

Python implementations are required to ...

By the way, Python  3.3 doesn't implement this requirement :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Saying that ord() and chr() switch between characters and code points is just 
plain wrong, since characters may be represented as multiple code points.

We may also want to explicitly note that the Unicode normalisation is 
implementation dependendent, and that CPython doesn't normalise implicitly 
except where specifically noted (i.e. during lexical analysis).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Right, narrow builds have long been broken - that's a large part of why this is 
now the requirement :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

Patch attached that also addresses the characters vs code points confusion.

--
Added file: 
http://bugs.python.org/file35489/issue21667_clarify_str_specification.rst

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

I ducked the Unicode normalisation question for now, since that's a *different* 
can of worms :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Two things:
- I don't think it's very helpful to use the term code point without 
explaining or introducing it (character at least can be understood 
intuitively)
- The mention of slicing is ambiguous: is slicing suppoded to be O(1)? how is 
indexing related to slicing?

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Nick Coghlan

Nick Coghlan added the comment:

If someone doesn't understand what Unicode code point means, that's going to 
be the least of their problems when it comes to implementing a conformant 
Python implementation. We could link to 
http://unicode.org/glossary/#code_point, but that doesn't really add much 
beyond value from 0 to 0x10. If you try to dive into the formal Unicode 
spec instead, you end up in a twisty maze of definitions of things that are all 
closely related, but generally not the same thing (code positions, code units, 
code spaces, abstract characters, glyphs, graphemes, etc).

The main advantage of using the more formal code point over the informal 
character is that it discourages people from assuming they know what they are 
(with the usual mistaken assumption being that Unicode code points correspond 
directly to glyphs the way ASCII and Extended ASCII printable characters 
correspond to their glyphs). The rest of the paragraph then provides the 
mechanical details of the meaningful interpretations of them in Python (as 
length 1 strings and as numbers in a particular range) and the operations for 
translating between those two formats (chr and ord).

Fair point about the slicing - it may be better to just talk about indexing.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Not sure what implementing a conformant Python implementation has to do with 
this; the language specification should be readable by any interested 
programmers, IMO.

 If you try to dive into the formal Unicode spec instead, you end up
 in a twisty maze of definitions of things that are all closely 
 related, but generally not the same thing (code positions, code units, 
 code spaces, abstract characters, glyphs, graphemes, etc).

That makes all the less useful to use the proper term instead of the more 
intuitive alternative :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Then perhaps we need notes about algorithmic complexity of bytes, bytearray, 
list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, 
list.pop, len(), + and += for all basic sequences and containers, memoryview() 
for bytes, bytearray and array.array, etc, etc.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread STINNER Victor

STINNER Victor added the comment:

 Then perhaps we need notes about algorithmic complexity of bytes, bytearray, 
 list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, 
 list.pop, len(), + and += for all basic sequences and containers, 
 memoryview() for bytes, bytearray and array.array, etc, etc.

That would be awesome :-) Please open a new issue for that. We can use
for example these data:
https://wiki.python.org/moin/TimeComplexity

Such data should be close to the definition of the type, or even close
to the method. Maybe directly in the documentation of each method?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Chris Angelico

Changes by Chris Angelico ros...@gmail.com:


--
nosy: +Rosuav

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue21667] Clarify status of O(1) indexing semantics of str objects

2014-06-05 Thread Guido van Rossum

Guido van Rossum added the comment:

I don't want the O(1) property explicitly denounced in the reference manual. 
It's fine if the manual is silent on this -- maybe someone can prove that it 
isn't a problem based on benchmarks of an alternate implementation, but until 
then, I'm skeptical -- after all we have a bunch of important APIs (indexing, 
slicing, find()/index(), the re module) that use integer indexes, and some 
important algorithms/patterns are based off this behavior.

E.g. searching a string for something, returning the position where it's found, 
and then continuing the search from that position. Even if the basic search 
uses find() or regex matching, there's still a position being returned and 
accepted, and if it took O(N) time to find that position in the representation, 
the whole algorithm could degenerate from O(N) to O(N**2).

I am fine with the changes related to code points.

For the pedants amongst us, surrogates are also code points. A surrogate pair 
is two code points that encode a single code point. Fortunately we don't have 
to deal with those any more outside codecs.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21667
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com