Re: [Python-ideas] discontinue iterable strings

2016-08-22 Thread Nick Coghlan
On 22 August 2016 at 19:47, Stephen J. Turnbull
 wrote:
> Nick Coghlan writes:
>
>  > However, the real problem with this proposal (and the reason why the
>  > switch from 8-bit str to "bytes are effectively a tuple of ints" in
>  > Python 3 was such a pain), is that there are a lot of bytes and text
>  > processing operations that *really do* operate code point by code
>  > point.
>
> Sure, but code points aren't strings in any language I use except
> Python.  And AFAIK strings are the only case in Python where a
> singleton *is* an element, and an element *is* a singleton.

Sure, but the main concern at hand ("list(strobj)" giving a broken out
list of individual code points rather than TypeError) isn't actually
related to the fact those individual items are themselves length-1
strings, it's related to the fact that Python normally considers
strings to be a sequence type rather than a scalar value type.

str is far from the only builtin container type that NumPy gives the
scalar treatment when sticking it into an array:

>>> np.array("abc")
array('abc', dtype='>> np.array(b"abc")
array(b'abc', dtype='|S3')
>>> np.array({1, 2, 3})
array({1, 2, 3}, dtype=object)
>>> np.array({1:1, 2:2, 3:3})
array({1: 1, 2: 2, 3: 3}, dtype=object)

(Interestingly, both bytearray and memoryview get interpreted as
"uint8" arrays, unlike the bytes literal - presumably the latter
discrepancy is a requirement for compatibility with NumPy's
str/unicode handling in Python 2)

That's why I suggested that a scalar proxy based on wrapt.ObjectProxy
that masked all container related protocols could be an interesting
future addition to the standard library (especially if it has been
battle-tested on PyPI first). "I want to take this container instance,
and make it behave like it wasn't a container, even if other code
tries to use it as a container" is usually what people are after when
they find str iteration inconvenient, but "treat this container as a
scalar value, but otherwise expose all of its methods" is an operation
with applications beyond strings.

Not-so-coincidentally, that approach would also give us a de facto
"code point" type: it would be the result of applying the scalar proxy
to a length 1 str instance.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] discontinue iterable strings

2016-08-22 Thread Stephen J. Turnbull
Nick Coghlan writes:

 > However, the real problem with this proposal (and the reason why the
 > switch from 8-bit str to "bytes are effectively a tuple of ints" in
 > Python 3 was such a pain), is that there are a lot of bytes and text
 > processing operations that *really do* operate code point by code
 > point.

Sure, but code points aren't strings in any language I use except
Python.  And AFAIK strings are the only case in Python where a
singleton *is* an element, and an element *is* a singleton.  (Except
it isn't: "ord('ab')" is a TypeError, even though "type('a')" returns
"".  )

I thought this was cute when I first encountered it (it happens that I
was studying how you can embed a set of elements into the semigroup of
sequences of such elements in algebra at the time), but it has *never*
been of practical use to me that indexing or iterating a str returns
str (rather than a code point).  "''.join(list('abc'))" being an
identity is an interesting, and maybe useful, fact, but I've never
missed it in languages that distinguish characters from strings.
Perhaps that's because they generally have a split function defined so
that "''.join('abc'.split(''))" is also available for that identity.
(N.B. Python doesn't accept an empty separator, but Emacs Lisp does,
where "'abc'.split('')" returns "['', 'a', 'b', 'c', '']".  I guess
it's too late to make this change, though.)

The reason that switching to bytes is a pain is that we changed the
return type of indexing bytes to something requiring conversion of
literals.  You can't write "bytething[i] == b'a'", you need to write
"bytething[i] == ord(b'a')", and "b''.join(list(b'abc')) is an error,
not an identity.  Of course the world broke!

 > But we're not designing a language from scratch - we're iterating
 > on one with a 25 year history of design, development, and use.

+1 to that.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/