Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Nick Coghlan
On 7 Jun 2014 00:53, "Paul Sokolovsky"  wrote:
>
> Yes. Except for one small detail - Python3 specifies these code points
> to be Unicode code points. And Unicode is a very bloated thing.

I rather suspect users of East Asian & African scripts might have a
different notion of what constitutes "bloated" vs "can actually represent
this language properly, unlike 8-bit code spaces".

> But if we drop that "Unicode" stipulation, then it's also exactly what
> MicroPython implements. Its "str" type consists of codepoints, we don't
> have pet names for them yet, like Unicode does, but their numeric
> values are 0-255. Note that it in no way limits encodings, characters,
> or scripts which can be used with MicroPython, because just like
> Unicode, it support concept of "surrogate pairs" (but we don't call it
> like that) - specifically, smaller code points may comprise bigger
> groupings. But unlike Unicode, we don't stipulate format, value or
> other constraints on how these "surrogate pairs"-alikes are formed,
> leaving that to users.

This is effectively what the Python 2 str type does, and it's a recipe for
data driven latent defects. You inevitably end up concatenating strings
using different code spaces, or else splitting strings between surrogate
pairs rather than on the proper boundaries, etc.

The abstraction presented to users by the str type *must* be the full range
of Unicode code points as atomic units. Storing those internally as UTF-8
rather than as fixed width code points as CPython does is an experiment
worth trying, since you don't have the same C level backwards compatibility
constraints we do. But limiting the str type to a single code page per
process is not an acceptable constraint in a Python 3 implementation.

Regards,
Nick.

>
>
> --
> Best regards,
>  Paul  mailto:pmis...@gmail.com
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Tim Delaney
On 7 June 2014 00:52, Paul Sokolovsky  wrote:

> > At heart, this is exactly what the Python 3 "str" type is. The
> > universal convention is "code points".
>
> Yes. Except for one small detail - Python3 specifies these code points
> to be Unicode code points. And Unicode is a very bloated thing.
>
> But if we drop that "Unicode" stipulation, then it's also exactly what
> MicroPython implements. Its "str" type consists of codepoints, we don't
> have pet names for them yet, like Unicode does, but their numeric
> values are 0-255. Note that it in no way limits encodings, characters,
> or scripts which can be used with MicroPython, because just like
> Unicode, it support concept of "surrogate pairs" (but we don't call it
> like that) - specifically, smaller code points may comprise bigger
> groupings. But unlike Unicode, we don't stipulate format, value or
> other constraints on how these "surrogate pairs"-alikes are formed,
> leaving that to users.


I think you've missed my point.

There is absolutely nothing conceptually bloaty about what a Python 3
string is. It's just like a 7-bit ASCII string, except each entry can be
from a larger table. When you index into a Python 3 string, you get back
exactly *one valid entry* from the Unicode code point table. That plus the
length of the string, plus the guarantee of immutability gives everything
needed to layer the rest of the string functionality on top.

There are no surrogate pairs - each code point is standalone (unlike code
*units*). It is conceptually very simple. The implementation may be
difficult (if you're trying to do better than 4 bytes per code point) but
the concept is dead simple.

If the MicroPython string type requires people *using* it to deal with
surrogates (i.e. indexing could return a value that is not a valid Unicode
code point) then it will have broken the conceptual simplicity of the
Python 3 string type (and most certainly can't be considered in any way
compatible).

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Fri, 06 Jun 2014 11:59:31 -0400
Terry Reedy  wrote:

[]

> The other problem is that a small slice view of a large object keeps
> the large object alive, so a view user needs to think carefully about 
> whether to make a copy or create a view, and later to copy views to 
> delete the base object. This is not for beginners.

Yes, so it doesn't make sense to add such feature to any of existing
APIs. However, as I pointed in another mail, it would make lot of sense
to add iterator-based string API (because if dict methods were
*switched* to iterators, why can't string have them *as alternative*),
and for their return values, it would be ~ natural to return "string
views", especially if it's clearly and explicitly described that if
user wants to store them, they should be explicitly copied via
str(view).

One reason against this would be of course API bloat. But API bloat
happens all the time, for example compare this modest proposal
http://bugs.python.org/issue21180 with what's going to be actually
implemented:
http://legacy.python.org/dev/peps/pep-0467/#alternate-constructors .


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Hrvoje Niksic

On 06/06/2014 05:59 PM, Terry Reedy wrote:

The other problem is that a small slice view of a large object keeps the
large object alive, so a view user needs to think carefully about
whether to make a copy or create a view, and later to copy views to
delete the base object. This is not for beginners.


And this was important enough that Java 7 actually removed the 
long-standing feature of String.substring creating a string that shares 
the character array with the original.


http://java-performance.info/changes-to-string-java-1-7-0_06/

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Terry Reedy

On 6/6/2014 4:53 AM, Hrvoje Niksic wrote:

On 06/04/2014 05:52 PM, Mark Lawrence wrote:



Out of idle curiosity is there anything that stops MicroPython, or any
other implementation for that matter, from providing views of a string
rather than copying every time?  IIRC memoryviews in CPython rely on the
buffer protocol at the C API level, so since strings don't support this
protocol you can't take a memoryview of them.  Could this actually be
implemented in the future, is the underlying C code just too
complicated, or what?



Memory view of Unicode strings is controversial for two reasons:

1. It exposes the internal representation of the string. If memoryviews
of strings were supported in Python 3, PEP 393 would not have been
possible (without breaking that feature).

2. Even if it were OK to expose the internal representation, it might
not be what the users expect. For example, memoryview("Hrvoje") would
return a view of a 6-byte buffer, while memoryview("Nikšić") would
return a view of a 12-byte UCS-2 buffer. The user of a memory view might
expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases.

An implementation that decided to export strings as memory views might
be forced to make a decision about internal representation of strings,
and then stick to it.

The byte objects don't have these issues, which is why in Python 2.7
memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.


The other problem is that a small slice view of a large object keeps the 
large object alive, so a view user needs to think carefully about 
whether to make a copy or create a view, and later to copy views to 
delete the base object. This is not for beginners.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Chris Angelico
On Fri, Jun 6, 2014 at 8:15 PM, Paul Sokolovsky  wrote:
> I'm sorry if I was somehow related to that, my
> bringing in the formal language spec was more a rhetorical figure, a
> response to people claiming O(1) requirement.

This was exactly why this whole discussion came up, though. We were
debating on the uPy bug tracker about how important O(1) indexing is;
I then came to python-list to try to get some solid data from which to
debate; and then the discussion jumped here to python-dev for more
solid explanations. The spec wasn't perfectly clear, and now it's
being made clearer: O(N) indexing does not violate Python's spec, ergo
uPy is allowed to use UTF-8 as its internal representation, as long as
script-visible behaviour is correct. It'll be interesting to see when
it's done (I'm currently working on that implementation, bit by bit)
and to run the CPython benchmarks on it.

It's been a fruitful and interesting discussion, and the formal
language spec is key to it. No need to apologize!

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Fri, 6 Jun 2014 21:48:41 +1000
Tim Delaney  wrote:

> On 6 June 2014 21:34, Paul Sokolovsky  wrote:
> 
> >
> > On Fri, 06 Jun 2014 20:11:27 +0900
> > "Stephen J. Turnbull"  wrote:
> >
> > > Paul Sokolovsky writes:
> > >
> > >  > That kinda means "string is atomic", instead of your
> > >  > "characters are atomic".
> > >
> > > I would be very surprised if a language that behaved that way was
> > > called a "Python subset".  No indexing, no slicing, no regexps, no
> > > .split(), no .startswith(), no sorted() or .sort(), ...!?
> > >
> > > If that's not what you mean by "string is atomic", I think you're
> > > using very confusing terminology.
> >
> > I'm sorry if I didn't mention it, or didn't make it clear enough -
> > it's all about layering.
> >
> > On level 0, you treat strings verbatim, and can write some subset of
> > apps (my point is that even this level allows to write lot enough
> > apps). Let's call this set A0.
> >
> > On level 1, you accept that there's some universal enough
> > conventions for some chars, like space or newline. And you can
> > write set of apps A1 > A0.
> >
> 
> At heart, this is exactly what the Python 3 "str" type is. The
> universal convention is "code points". 

Yes. Except for one small detail - Python3 specifies these code points
to be Unicode code points. And Unicode is a very bloated thing.

But if we drop that "Unicode" stipulation, then it's also exactly what
MicroPython implements. Its "str" type consists of codepoints, we don't
have pet names for them yet, like Unicode does, but their numeric
values are 0-255. Note that it in no way limits encodings, characters,
or scripts which can be used with MicroPython, because just like
Unicode, it support concept of "surrogate pairs" (but we don't call it
like that) - specifically, smaller code points may comprise bigger
groupings. But unlike Unicode, we don't stipulate format, value or
other constraints on how these "surrogate pairs"-alikes are formed,
leaving that to users.


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Mark Lawrence

On 06/06/2014 09:53, Hrvoje Niksic wrote:

On 06/04/2014 05:52 PM, Mark Lawrence wrote:

On 04/06/2014 16:32, Steve Dower wrote:


If copying into a separate list is a problem (memory-wise),
re.finditer('\\S+', string) also provides the same behaviour and
gives me the sliced string, so there's no need to index for anything.



Out of idle curiosity is there anything that stops MicroPython, or any
other implementation for that matter, from providing views of a string
rather than copying every time?  IIRC memoryviews in CPython rely on the
buffer protocol at the C API level, so since strings don't support this
protocol you can't take a memoryview of them.  Could this actually be
implemented in the future, is the underlying C code just too
complicated, or what?



Memory view of Unicode strings is controversial for two reasons:

1. It exposes the internal representation of the string. If memoryviews
of strings were supported in Python 3, PEP 393 would not have been
possible (without breaking that feature).

2. Even if it were OK to expose the internal representation, it might
not be what the users expect. For example, memoryview("Hrvoje") would
return a view of a 6-byte buffer, while memoryview("Nikšić") would
return a view of a 12-byte UCS-2 buffer. The user of a memory view might
expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases.

An implementation that decided to export strings as memory views might
be forced to make a decision about internal representation of strings,
and then stick to it.

The byte objects don't have these issues, which is why in Python 2.7
memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.



Thanks for the explanation :)

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Fri, 06 Jun 2014 09:32:25 +0100
Mark Lawrence  wrote:

> On 04/06/2014 16:52, Mark Lawrence wrote:
> > On 04/06/2014 16:32, Steve Dower wrote:
> >>
> >> If copying into a separate list is a problem (memory-wise),
> >> re.finditer('\\S+', string) also provides the same behaviour and
> >> gives me the sliced string, so there's no need to index for
> >> anything.
> >>
> >
> > Out of idle curiosity is there anything that stops MicroPython, or
> > any other implementation for that matter, from providing views of a
> > string rather than copying every time?  IIRC memoryviews in CPython
> > rely on the buffer protocol at the C API level, so since strings
> > don't support this protocol you can't take a memoryview of them.
> > Could this actually be implemented in the future, is the underlying
> > C code just too complicated, or what?
> >
> 
> Anybody?

I'd like to address this, and other, buffer manipulation
optimization ideas I have for MicroPython at some time later. But as
you suggest, it would possible to transparently have
"strings-by-reference". The reasons MicroPython doesn't have such so
far (and why I'm, as a uPy contributor, not ready to discuss them) is
because they're optimization, and everyone knows what premature
optimization is.

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Tim Delaney
On 6 June 2014 21:34, Paul Sokolovsky  wrote:

>
> On Fri, 06 Jun 2014 20:11:27 +0900
> "Stephen J. Turnbull"  wrote:
>
> > Paul Sokolovsky writes:
> >
> >  > That kinda means "string is atomic", instead of your "characters
> >  > are atomic".
> >
> > I would be very surprised if a language that behaved that way was
> > called a "Python subset".  No indexing, no slicing, no regexps, no
> > .split(), no .startswith(), no sorted() or .sort(), ...!?
> >
> > If that's not what you mean by "string is atomic", I think you're
> > using very confusing terminology.
>
> I'm sorry if I didn't mention it, or didn't make it clear enough - it's
> all about layering.
>
> On level 0, you treat strings verbatim, and can write some subset of
> apps (my point is that even this level allows to write lot enough
> apps). Let's call this set A0.
>
> On level 1, you accept that there's some universal enough conventions
> for some chars, like space or newline. And you can write set of
> apps A1 > A0.
>

At heart, this is exactly what the Python 3 "str" type is. The universal
convention is "code points". It's got nothing to do with encodings, or
bytes. A Python string is simply a finite sequence of atomic code points -
it is indexable, and it has a length. Once you have that, everything is
layered on top of it. How the code points themselves are implemented is
opaque and irrelevant other than the memory and performance consequences of
the implementation decisions (for example, a string could be indexable by
iterating from the start until you find the nth code point).

Similarly the "bytes" type is a sequence of 8-bit bytes.

Encodings are simply a way to transport code points via a byte-oriented
transport.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Nick Coghlan
On 6 June 2014 21:15, Paul Sokolovsky  wrote:
> Hello,
>
> On Thu, 5 Jun 2014 23:15:54 +1000
> Nick Coghlan  wrote:
>
>> On 5 June 2014 22:37, Paul Sokolovsky  wrote:
>> > On Thu, 5 Jun 2014 22:20:04 +1000
>> > Nick Coghlan  wrote:
>> >> problems caused by trusting the locale encoding to be correct, but
>> >> the startup code will need non-trivial changes for that to happen
>> >> - the C.UTF-8 locale may even become widespread before we get
>> >> there).
>> >
>> > ... And until those golden times come, it would be nice if Python
>> > did not force its perfect world model, which unfortunately is not
>> > based on surrounding reality, and let users solve their encoding
>> > problems themselves - when they need, because again, one can go
>> > quite a long way without dealing with encodings at all. Whereas now
>> > Python3 forces users to deal with encoding almost universally, but
>> > forcing a particular for all strings (which is again, doesn't
>> > correspond to the state of surrounding reality). I already hear
>> > response that it's good that users taught to deal with encoding,
>> > that will make them write correct programs, but that's a bit far
>> > away from the original aim of making it write "correct" programs
>> > easy and pleasant. (And definition of "correct" vary.)
>>
>> As I've said before in other contexts, find me Windows, Mac OS X and
>> JVM developers, or educators and scientists that are as concerned by
>> the text model changes as folks that are primarily focused on Linux
>> system (including network) programming, and I'll be more willing to
>> concede the point.
>
> Well, but this question reduces to finding out (or specifying) who are
> target audiences of Python. It always has been (with a bow to Guido)
> forpost of scientific users (and probably even if there was mass exodus
> of other categories of users will remain prominent in that role). But
> Python has always had its share as system scripting language among
> Perl-haters, and with Perl going flatline, I guess it's fair to say
> that Python is major system scripting and service implementation
> language.

Correct - and the efforts of a number of core developers are focused
on getting the Linux distros and major projects like OpenStack
migrated. If other Linux users say "I'm not switching to Python 3
until after my distro has switched their own Python applications
over", that's a perfectly reasonable course of action for them to
take. After all, that approach to the adoption of new Python versions
is a large part of why Python 2.6 is still so widely supported by
library and framework developers: enterprise Linux distros haven't
even finished migrating to Python 2.7 yet, let alone Python 3. (The
other reason is that the language moratorium that was applied to
Python 2.7 and 3.2 means that supporting back to Python 2.6 isn't that
much harder than supporting 2.7 at this point in time).

That said, the feedback from the early adopters of Python 3 on Linux
is proving invaluable, and Linux users in general will benefit from
their work as the distros move their infrastructure applications over.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Fri, 06 Jun 2014 20:11:27 +0900
"Stephen J. Turnbull"  wrote:

> Paul Sokolovsky writes:
> 
>  > That kinda means "string is atomic", instead of your "characters
>  > are atomic".
> 
> I would be very surprised if a language that behaved that way was
> called a "Python subset".  No indexing, no slicing, no regexps, no
> .split(), no .startswith(), no sorted() or .sort(), ...!?
> 
> If that's not what you mean by "string is atomic", I think you're
> using very confusing terminology.

I'm sorry if I didn't mention it, or didn't make it clear enough - it's
all about layering.

On level 0, you treat strings verbatim, and can write some subset of
apps (my point is that even this level allows to write lot enough
apps). Let's call this set A0.

On level 1, you accept that there's some universal enough conventions
for some chars, like space or newline. And you can write set of 
apps A1 > A0.

On level 2, you add len(), and - oh magic - you now can center a string
within fixed-size field, something you probably to as often as once a
month, so hopefully that will keep you busy for few.

On level 3, it indeed starts to smell Unicode, we get isdigit(),
isalpha(), which require long boring tables, which hopefully can be
compressed enough to fit in your pocket.

On level 4, it's pumping up, with tolower() and friends, tables for
which you carry around in suitcase.

On level 5, everything is Unicode, what a bliss! You can even start
pretending that no other levels exist (God created Unicode on a second
day).

On level 6, there're mind-boggling, ugly manual-use utilities to deal
with internals of "magic" "working on its own for everyone" encoding to
deal with stuff like code-point vs charecters vs surrogate pair
vs grapheme separation, etc.



So, once again, for me and some other people, it's not that bright idea
to shoot for level 5 if levels 0-4 exist and well-proven pragmatic
model. And level 6 is still there anyway.


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 23:15:54 +1000
Nick Coghlan  wrote:

> On 5 June 2014 22:37, Paul Sokolovsky  wrote:
> > On Thu, 5 Jun 2014 22:20:04 +1000
> > Nick Coghlan  wrote:
> >> problems caused by trusting the locale encoding to be correct, but
> >> the startup code will need non-trivial changes for that to happen
> >> - the C.UTF-8 locale may even become widespread before we get
> >> there).
> >
> > ... And until those golden times come, it would be nice if Python
> > did not force its perfect world model, which unfortunately is not
> > based on surrounding reality, and let users solve their encoding
> > problems themselves - when they need, because again, one can go
> > quite a long way without dealing with encodings at all. Whereas now
> > Python3 forces users to deal with encoding almost universally, but
> > forcing a particular for all strings (which is again, doesn't
> > correspond to the state of surrounding reality). I already hear
> > response that it's good that users taught to deal with encoding,
> > that will make them write correct programs, but that's a bit far
> > away from the original aim of making it write "correct" programs
> > easy and pleasant. (And definition of "correct" vary.)
> 
> As I've said before in other contexts, find me Windows, Mac OS X and
> JVM developers, or educators and scientists that are as concerned by
> the text model changes as folks that are primarily focused on Linux
> system (including network) programming, and I'll be more willing to
> concede the point.

Well, but this question reduces to finding out (or specifying) who are
target audiences of Python. It always has been (with a bow to Guido)
forpost of scientific users (and probably even if there was mass exodus
of other categories of users will remain prominent in that role). But
Python has always had its share as system scripting language among
Perl-haters, and with Perl going flatline, I guess it's fair to say
that Python is major system scripting and service implementation
language.

To whom all features like memoryview, array.array, in-place
input operations, etc. cater? To scientists? I'm sure most of them are
just happy with stuffing "@jit" for their kernel functions. And
scientist who bother with memoryviews for their data structures are
system-level-ish programmers too.

So, no wonder that Linux crowd cries at Python3 - it makes doing simple
things unnecessarily complicated.

> Windows, Mac OS X, and the JVM are all opinionated about the text
> encodings to be used at platform boundaries (using UTF-16, UTF-8 and
> UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX)
> says "well, it's configurable, but we won't provide a reliable
> mechanism for finding out what the encoding is. So either guess as

[]

Yes, I understand complexity of developing cross-platform language with
advanced features. By I may offer another look at all this activity:
Python3 was brave enough to do revolution in its own world (catching a
lot of its users by surprise), but surely not brave enough to do
revolution around itself, by saying something like "We choose ONE, the
most right, and even the most used (per bytes transferred) encoding as
our standard I/O encoding. Grow up or explicitly specify encoding which
you personally need.".

Surely, it didn't to that - it makes no sense to fight the world. But
then Python3 is sympathetic about Java's desire to use "UTF-16" instead
of "right" encoding, and no so about Unix desire to treat encodings
as a separate level from content (and treating Unicode by nothing else
as yet another arbitrary encoding, which it is formally, and will be
for a long time de-facto, however sad it is). So, maybe "cross-platform"
should have mean "don't do implicit conversions". Because see, Python2
had a problem with implicit encoding conversion when str and unicode
objects were mixed, and Python3 has problem with implicit conversions
whenever str is used at all.


Anyway, I appreciate detailed responses, and understand what you
(Python3 developers) are trying to achieve, and appreciate your work,
and hope it all work out. Each user has own concerns about Unicode.
Mine are efficiency and layering. But once MicroPython has UTF-8 support
I will be much more relaxed about it. Layering is harder to accept, but
hopefully can be tackled too both on own mind's and technical sides. I
hope other users will find their peace with Unicode too!


[]


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Stephen J. Turnbull
Paul Sokolovsky writes:

 > That kinda means "string is atomic", instead of your "characters are
 > atomic".

I would be very surprised if a language that behaved that way was
called a "Python subset".  No indexing, no slicing, no regexps, no
.split(), no .startswith(), no sorted() or .sort(), ...!?

If that's not what you mean by "string is atomic", I think you're
using very confusing terminology.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Greg Ewing

Steven D'Aprano wrote:
I don't know about car engine controllers, but presumably they have 
diagnostic ports, and they may sometimes output text. If they output 
text, then at least hypothetically car mechanics in Russia might prefer 
their car to output "правда" and "ложный" rather than "true" and 
"false".


From a bit of googling, it seems that engine controller
diagnostic ports typically speak some kind of binary
protocol. So it would be up to the software running on
whatever was plugged into the port to display the
information in the user's native language.

E.g. this document lists a big pile of hex byte values
and little or no text that I can see:

https://law.resource.org/pub/us/cfr/ibr/005/sae.j1979.2002.pdf

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 22:38:13 +1000
Nick Coghlan  wrote:

> On 5 June 2014 22:10, Stefan Krah  wrote:
> > Paul Sokolovsky  wrote:
> >> In this regard, I'm glad to participate in mind-resetting
> >> discussion. So, let's reiterate - there's nothing like "the best",
> >> "the only right", "the only correct", "righter than", "more
> >> correct than" in CPython's implementation of Unicode storage. It
> >> is *arbitrary*. Well, sure, it's not arbitrary, but based on
> >> requirements, and these requirements match CPython's (implied)
> >> usage model well enough. But among all possible sets of
> >> requirements, CPython's requirements are no more valid that other
> >> possible. And other set of requirement fairly clearly lead to
> >> situation where CPython implementation is rejected as not correct
> >> for those requirements at all.
> >
> > Several core-devs have said that using UTF-8 for MicroPython is
> > perfectly okay. I also think it's the right choice and I hope that
> > you guys come up with a very efficient implementation.
> 
> Based on this discussion , I've also posted a draft patch aimed at
> clarifying the relevant aspects of the data model section of the
> language reference (http://bugs.python.org/issue21667).

Thanks, it's very much appreciated. Though, the discussion there opened
another can of worms. I'm sorry if I was somehow related to that, my
bringing in the formal language spec was more a rhetorical figure, a
response to people claiming O(1) requirement. So, it either should be
in spec, or spec should be treated as such - something not specified
means it's underspecified and implementation-dependent. I'm glad that
the last point now explicitly pronounced by BDFL in the last comment
of that ticket (http://bugs.python.org/issue21667#msg219824)

> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 22:21:30 +1000
Tim Delaney  wrote:

> On 5 June 2014 22:01, Paul Sokolovsky  wrote:
> 
> >
> > All these changes are what let me dream on and speculate on
> > possibility that Python4 could offer an encoding-neutral string type
> > (which means based on bytes)
> >
> 
> To me, an "encoding neutral string type" means roughly "characters are
> atomic", and the best representation we have for a "character" is a

And for me it means exactly what "encoding neutral string type" moniker
promises - that you should not make any assumption about its encoding.
That kinda means "string is atomic", instead of your "characters are
atomic". That's the most basic level, and you can write a big enough
set of applications using it - for example, get some information from
user, store in database, then show back to user at later time.

[]

> 
> Cheers,
> 
> Tim Delaney



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Hrvoje Niksic

On 06/04/2014 05:52 PM, Mark Lawrence wrote:

On 04/06/2014 16:32, Steve Dower wrote:


If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.



Out of idle curiosity is there anything that stops MicroPython, or any
other implementation for that matter, from providing views of a string
rather than copying every time?  IIRC memoryviews in CPython rely on the
buffer protocol at the C API level, so since strings don't support this
protocol you can't take a memoryview of them.  Could this actually be
implemented in the future, is the underlying C code just too
complicated, or what?



Memory view of Unicode strings is controversial for two reasons:

1. It exposes the internal representation of the string. If memoryviews 
of strings were supported in Python 3, PEP 393 would not have been 
possible (without breaking that feature).


2. Even if it were OK to expose the internal representation, it might 
not be what the users expect. For example, memoryview("Hrvoje") would 
return a view of a 6-byte buffer, while memoryview("Nikšić") would 
return a view of a 12-byte UCS-2 buffer. The user of a memory view might 
expect to get UCS-2 (or UCS-4, or even UTF-8) in all cases.


An implementation that decided to export strings as memory views might 
be forced to make a decision about internal representation of strings, 
and then stick to it.


The byte objects don't have these issues, which is why in Python 2.7 
memoryview("foo") works just fine, as does memoryview(b"foo") in Python 3.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-06 Thread Mark Lawrence

On 04/06/2014 16:52, Mark Lawrence wrote:

On 04/06/2014 16:32, Steve Dower wrote:


If copying into a separate list is a problem (memory-wise),
re.finditer('\\S+', string) also provides the same behaviour and gives
me the sliced string, so there's no need to index for anything.



Out of idle curiosity is there anything that stops MicroPython, or any
other implementation for that matter, from providing views of a string
rather than copying every time?  IIRC memoryviews in CPython rely on the
buffer protocol at the C API level, so since strings don't support this
protocol you can't take a memoryview of them.  Could this actually be
implemented in the future, is the underlying C code just too
complicated, or what?



Anybody?

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Steven D'Aprano
On Fri, Jun 06, 2014 at 12:51:11PM +1200, Greg Ewing wrote:
> Steven D'Aprano wrote:
> >(1) I asked if it would be okay for MicroPython to *optionally* use 
> >nominally Unicode strings limited to ASCII. Pretty much the only 
> >response to this as been Guido saying "That would be a pretty lousy 
> >option",
> 
> It would be limiting to have this as the *only* way of
> dealing with unicode, but I don't see anything wrong with
> having this available as an option for applications that
> truly don't need anything more than ascii. There must be
> plenty of those; the controller that runs my car engine,
> for example, doesn't exchange text with the outside world
> at all.

I don't know about car engine controllers, but presumably they have 
diagnostic ports, and they may sometimes output text. If they output 
text, then at least hypothetically car mechanics in Russia might prefer 
their car to output "правда" and "ложный" rather than "true" and 
"false". I think that opportunities for ASCII-only optimizations are 
shrinking, not getting bigger, as more people come to expect that their 
computing devices speak their language rather than Foreign.


> >The 
> >rationale of internal UTF-8 is that the use of any other encoding 
> >internally will be inefficient since those strings will need to be 
> >transcoded to UTF-8 before they can be written or printed,
> 
> No, I think the rationale is that UTF-8 is likely to use
> less memory than UTF-16 or UTF-32.

Right. I was talking about memory efficiency. Instead of this, which 
requires two copies of the string at one time:

1) accept UTF-8 bytes
2) transcode to internal representation
3) discard UTF-8 bytes

you could have:

1) accept UTF-8 bytes

and be done.


-- 
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Greg Ewing

Paul Sokolovsky wrote:

All these changes are what let me dream on and speculate on
possibility that Python4 could offer an encoding-neutral string type
(which means based on bytes)


Can you elaborate on exactly what you have in mind?
You seem to want something different from Python 3 str,
Python 3 bytes and Python 2 str, but it's far from
clear what you want this type to be like.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Greg Ewing

Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use 
nominally Unicode strings limited to ASCII. Pretty much the only 
response to this as been Guido saying "That would be a pretty lousy 
option",


It would be limiting to have this as the *only* way of
dealing with unicode, but I don't see anything wrong with
having this available as an option for applications that
truly don't need anything more than ascii. There must be
plenty of those; the controller that runs my car engine,
for example, doesn't exchange text with the outside world
at all.

The 
rationale of internal UTF-8 is that the use of any other encoding 
internally will be inefficient since those strings will need to be 
transcoded to UTF-8 before they can be written or printed,


No, I think the rationale is that UTF-8 is likely to use
less memory than UTF-16 or UTF-32.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 6 Jun 2014 05:13, "Glenn Linderman"  wrote:
>
> On 6/5/2014 11:41 AM, Daniel Holth wrote:
>>
>> discover new things
>> like dance-encoded strings, bytes decoded using an incorrect encoding
>> intended to be transcoded into the correct encoding later, surrogates
>> that work perfectly until .encode(), str(bytes), APIs that disagree
>> with you about whether the result should be str or bytes, APIs that
>> return either string or bytes depending on their initializers and so
>> on. Unicode can still be complicated in Python 3 independent of any
>> judgement about whether it is worse, better, or different than Python
>> 2.
>
> Yes, people can find ways to write bad code in any language.

Note that several of the issues Daniel mentions here are due to the lack of
reliable encoding settings on Linux and the challenges of the Py2->3
migration, rather than users writing bad code. Several of them represent
bugs to be fixed or serve as indicators of missing features that would make
it easier to work around an imperfect world.

Cheers,
Nick.

>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Antoine Pitrou

Le 04/06/2014 02:51, Chris Angelico a écrit :

On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan  wrote:

It would. The downsides of a UTF-8 representation would be slower
iteration and much slower (O(N)) indexing/slicing.


There's no reason for iteration to be slower. Slicing would get
O(slice offset + slice size) instead of O(slice size).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Glenn Linderman

On 6/5/2014 11:41 AM, Daniel Holth wrote:

discover new things
like dance-encoded strings, bytes decoded using an incorrect encoding
intended to be transcoded into the correct encoding later, surrogates
that work perfectly until .encode(), str(bytes), APIs that disagree
with you about whether the result should be str or bytes, APIs that
return either string or bytes depending on their initializers and so
on. Unicode can still be complicated in Python 3 independent of any
judgement about whether it is worse, better, or different than Python
2.

Yes, people can find ways to write bad code in any language.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Glenn Linderman

On 6/5/2014 3:10 AM, Paul Sokolovsky wrote:

Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy  wrote:


think you are again batting at a strawman. If you mean 'read from a
file', and all you want to do is read bytes from and write bytes to
external 'files', then there is obviously no need to transcode and
neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use "binary data" type for them,
with all attached funny things, like "b" prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).


If all your program is doing is reading and writing data (input data 
will be output as-is), then use of binary doesn't require "b" prefix, 
because you aren't manipulating the data. Then you have no unnecessary 
transcoding.


If you actually wish to examine or manipulate the content as it flows 
by, then there are choices.


1) If you need to examine/manipulate only a small fraction of text data 
with the file, you can pay the small price of a few "b" prefixes to get 
high performance, and explicitly transcode only the portions that need 
to be manipulated.


2) If you are examining the bulk of the data as it flows by, but not 
manipulating it, just examining/extracting, then a full transcoding may 
be useful for that purpose... but you can perhaps do it explicitly, so 
that you keep the binary form for I/O. Careful of the block boundaries, 
in this case, however.


3) If you are actually manipulating the bulk of the data, then the 
double transcoding (once on input, and once on output) allows you to 
work in units of codepoints, rather than bytes, which generally makes 
the manipulation algorithms easier.


4) If you truly cannot afford the processor code of the double 
transcoding, and need to do all your manipulations at the byte level, 
then you could avoid the need for "b" prefix by use of a preprocessor 
for those sections of code that are doing all and only bytes 
processing... and you'll have lots of arcane, error-prone code to write 
to manipulate the bytes rather than the codepoints.


On the other hand, if you can convince your data sources and sinks to 
deal in UTF-8, and implement a UTF-8 str in μPy, then you can both avoid 
transcoding, and make the arcane algorithms part of the implementation 
of μPy rather than of the application code, and support full Unicode. 
And it seems to me that the world is moving that way... towards UTF-8 as 
the standard interchange format. Encourage it.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Daniel Holth
On Thu, Jun 5, 2014 at 11:59 AM, Paul Moore  wrote:
> On 5 June 2014 14:15, Nick Coghlan  wrote:
>> As I've said before in other contexts, find me Windows, Mac OS X and
>> JVM developers, or educators and scientists that are as concerned by
>> the text model changes as folks that are primarily focused on Linux
>> system (including network) programming, and I'll be more willing to
>> concede the point.
>
> There is once again a strong selection bias in this discussion, by its
> very nature. People who like the new model don't have anything to
> complain about, and so are not heard.
>
> Just to support Nick's point, I for one find the Python 3 text model a
> huge benefit, both in practical terms of making my programs more
> robust, and educationally, as I have a far better understanding of
> encodings and their issues than I ever did under Python 2. Whenever a
> discussion like this occurs, I find it hard not to resent the people
> arguing that the new model should be taken away from me and replaced
> with a form of the old error-prone (for me) approach - as if it was in
> my best interests.
>
> Internal details don't bother me - using UTF8 and having indexing be
> potentially O(N) is of little relevance. But make me work with a
> string type that *doesn't* abstract a string as a sequence of Unicode
> code points and I'll get very upset.

Once you get past whether str + bytes throws an exception which seems
to be the divide most people focus on, you can discover new things
like dance-encoded strings, bytes decoded using an incorrect encoding
intended to be transcoded into the correct encoding later, surrogates
that work perfectly until .encode(), str(bytes), APIs that disagree
with you about whether the result should be str or bytes, APIs that
return either string or bytes depending on their initializers and so
on. Unicode can still be complicated in Python 3 independent of any
judgement about whether it is worse, better, or different than Python
2.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Moore
On 5 June 2014 14:15, Nick Coghlan  wrote:
> As I've said before in other contexts, find me Windows, Mac OS X and
> JVM developers, or educators and scientists that are as concerned by
> the text model changes as folks that are primarily focused on Linux
> system (including network) programming, and I'll be more willing to
> concede the point.

There is once again a strong selection bias in this discussion, by its
very nature. People who like the new model don't have anything to
complain about, and so are not heard.

Just to support Nick's point, I for one find the Python 3 text model a
huge benefit, both in practical terms of making my programs more
robust, and educationally, as I have a far better understanding of
encodings and their issues than I ever did under Python 2. Whenever a
discussion like this occurs, I find it hard not to resent the people
arguing that the new model should be taken away from me and replaced
with a form of the old error-prone (for me) approach - as if it was in
my best interests.

Internal details don't bother me - using UTF8 and having indexing be
potentially O(N) is of little relevance. But make me work with a
string type that *doesn't* abstract a string as a sequence of Unicode
code points and I'll get very upset.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Steven D'Aprano
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
> There is a discussion over at MicroPython about the internal 
> representation of Unicode strings. Micropython is aimed at embedded 
> devices, and so minimizing memory use is important, possibly even 
> more important than performance.
[...]

Wow! I'm amazed at the response here, since I expected it would have 
received a fairly brief "Yes" or "No" response, not this long thread. 
Here is a summary (as best as I am able) of a few points which I think 
are important:

(1) I asked if it would be okay for MicroPython to *optionally* use 
nominally Unicode strings limited to ASCII. Pretty much the only 
response to this as been Guido saying "That would be a pretty lousy 
option", and since nobody has really defended the suggestion, I think we 
can assume that it's off the table.

(2) I asked if it would be okay for µPy to use an UTF-8 implementation 
even though it would lead to O(N) indexing operations instead of O(1). 
There's been some opposition to this, including Guido's:

Then again the UTF-8 option would be pretty devastating 
too for anything manipulating strings (especially since 
many Python APIs are defined using indexes, e.g. the re 
module).

but unless Guido wants to say different, I think the consensus is that 
a UTF-8 implementation is allowed, even at the cost of O(N) indexing 
operations. Saving memory -- assuming that it does save memory, which I 
think is an assumption and not proven -- over time is allowed.

(3) It seems to me that there's been a lot of theorizing about what 
implementation will be obviously more efficient. Folks, how about some 
benchmarks before making claims about code efficiency? :-)

(4) Similarly, there have been many suggestions more suited in my 
opinion to python-ideas, or even python-list, for ways to implement O(1) 
indexing on top of UTF-8. Some of them involve per-string mutable state 
(e.g. the last index seen), or complicated int sub-classes that need to 
know what string they come from. Remember your Zen please:

Simple is better than complex.
Complex is better than complicated.
...
If the implementation is hard to explain, it's a bad idea.

(5) I'm not convinced that UTF-8 internally is *necessarily* more 
efficient, but look forward to seeing the result of benchmarks. The 
rationale of internal UTF-8 is that the use of any other encoding 
internally will be inefficient since those strings will need to be 
transcoded to UTF-8 before they can be written or printed, so keeping 
them as UTF-8 in the first place saves the transcoding step. Well, yes, 
but many strings may never be written out:

print(prefix + s[1:].strip().lower().center(80) + suffix)

creates five strings that are never written out and one that is. So if 
the internal encoding of strings is more efficient than UTF-8, and most 
of them never need transcoding to UTF-8, a non-UTF-8 internal format 
might be a nett win. So I'm looking forward to seeing the results of 
µPy's experiments with it.

Thanks to all who have commented.



-- 
Steven

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:37, Paul Sokolovsky  wrote:
> On Thu, 5 Jun 2014 22:20:04 +1000
> Nick Coghlan  wrote:
>> problems caused by trusting the locale encoding to be correct, but the
>> startup code will need non-trivial changes for that to happen - the
>> C.UTF-8 locale may even become widespread before we get there).
>
> ... And until those golden times come, it would be nice if Python did
> not force its perfect world model, which unfortunately is not based on
> surrounding reality, and let users solve their encoding problems
> themselves - when they need, because again, one can go quite a long way
> without dealing with encodings at all. Whereas now Python3 forces users
> to deal with encoding almost universally, but forcing a particular for
> all strings (which is again, doesn't correspond to the state of
> surrounding reality). I already hear response that it's good that users
> taught to deal with encoding, that will make them write correct
> programs, but that's a bit far away from the original aim of making it
> write "correct" programs easy and pleasant. (And definition of
> "correct" vary.)

As I've said before in other contexts, find me Windows, Mac OS X and
JVM developers, or educators and scientists that are as concerned by
the text model changes as folks that are primarily focused on Linux
system (including network) programming, and I'll be more willing to
concede the point.

Windows, Mac OS X, and the JVM are all opinionated about the text
encodings to be used at platform boundaries (using UTF-16, UTF-8 and
UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX)
says "well, it's configurable, but we won't provide a reliable
mechanism for finding out what the encoding is. So either guess as
best you can based on the info the OS *does* provide, assume UTF-8,
assume 'some ASCII compatible encoding', or don't do anything that
requires knowing the encoding of the data being exchanged with the OS,
like, say, displaying file names to users or accepting arbitrary text
as input, transforming it in a content aware fashion, and echoing it
back in a console application".

None of those options are perfectly good choices. 6(ish) years ago, we
chose the first option, because it has the best chance of working
properly on Linux systems that use ASCII incompatible encodings like
ShiftJIS, ISO-2022, and various other East Asian codecs. For normal
user space programming, Linux is pretty reliable when it comes to
ensuring the locale encoding is set to something sensible, but the
price we currently pay for that decision is interoperability issues
with things like daemons not receiving any configuration settings and
hence falling back the POSIX locale and ssh environment forwarding
moving a clients encoding settings to a session on a server with
different settings. I still consider it preferable to impose
inconveniences like that based on use case (situations where Linux
systems don't provide sensible encoding settings) than geographic
region (locales where ASCII incompatible encodings are likely to still
be in common use).

If I (or someone else) ever find the time to implement PEP 432 (or
something like it) to address some of the limitations of the
interpreter startup sequence that currently make it difficult to avoid
relying on the POSIX locale encoding on Linux, then we'll be in a
position to reassess that decision based on the increased adoption of
UTF-8 by Linux distributions in recent years. As the major community
Linux distributions complete the migration of their system utilities
to Python 3, we'll get to see if they decide it's better to make their
locale settings more reliable, or help make it easier for Python 3 to
ignore them when they're wrong.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:10, Stefan Krah  wrote:
> Paul Sokolovsky  wrote:
>> In this regard, I'm glad to participate in mind-resetting discussion.
>> So, let's reiterate - there's nothing like "the best", "the only right",
>> "the only correct", "righter than", "more correct than" in CPython's
>> implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
>> not arbitrary, but based on requirements, and these requirements match
>> CPython's (implied) usage model well enough. But among all possible
>> sets of requirements, CPython's requirements are no more valid that
>> other possible. And other set of requirement fairly clearly lead to
>> situation where CPython implementation is rejected as not correct for
>> those requirements at all.
>
> Several core-devs have said that using UTF-8 for MicroPython is perfectly 
> okay.
> I also think it's the right choice and I hope that you guys come up with a 
> very
> efficient implementation.

Based on this discussion , I've also posted a draft patch aimed at
clarifying the relevant aspects of the data model section of the
language reference (http://bugs.python.org/issue21667).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 22:20:04 +1000
Nick Coghlan  wrote:

[]
> problems caused by trusting the locale encoding to be correct, but the
> startup code will need non-trivial changes for that to happen - the
> C.UTF-8 locale may even become widespread before we get there).

... And until those golden times come, it would be nice if Python did
not force its perfect world model, which unfortunately is not based on
surrounding reality, and let users solve their encoding problems
themselves - when they need, because again, one can go quite a long way
without dealing with encodings at all. Whereas now Python3 forces users
to deal with encoding almost universally, but forcing a particular for
all strings (which is again, doesn't correspond to the state of
surrounding reality). I already hear response that it's good that users
taught to deal with encoding, that will make them write correct
programs, but that's a bit far away from the original aim of making it
write "correct" programs easy and pleasant. (And definition of
"correct" vary.)

But all that is just an opinion.

> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Tim Delaney
On 5 June 2014 22:01, Paul Sokolovsky  wrote:

>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes)
>

To me, an "encoding neutral string type" means roughly "characters are
atomic", and the best representation we have for a "character" is a Unicode
code point. Through any interface that provides "characters" each
individual "character" (code point) is indivisible.

To me, Python 3 has exactly an "encoding-neutral string type". It also has
a bytes type that is is just that - bytes which can represent anything at
all.It might be the UTF-8 representation of a string, but you have the
freedom to manipulate it however you like - including making it no longer
valid UTF-8.

Whilst I think O(1) indexing of strings is important, I don't think it's as
important as the property that "characters" are indivisible and would be
quite happy for MicroPython to use UTF-8 as the underlying string
representation (or some more clever thing, several ideas in this thread) so
long as:

1. It maintains a string type that presents code points as indivisible
elements;

2. The performance consequences of using UTF-8 are documented, as well as
any optimisations, tricks, etc that are used to overcome those consequences
(and what impact if any they would have if code written for MicroPython was
run in CPython).

Cheers,

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 22:01, Paul Sokolovsky  wrote:
>> Aside from
>> some of the POSIX locale handling issues on Linux, many of the
>> concerns are with the usability of bytes and bytearray, not with str -
>> that's why binary interpolation is coming back in 3.5, and there will
>> likely be other usability tweaks for those types as well.
>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes), while move unicode back to an explicit
> type to be used explicitly only when needed (bloated frameworks like
> Django can force users to it anyway, but that will be forcing on
> framework level, not on language level, against which people rebel.)
> People can dream, right?

If you don't model strings as arrays of code points, or at least
assume a particular universal encoding (like UTF-8), you have to give
up string concatenation in order to tolerate arbitrary encodings -
otherwise you end up with unintelligible data that nobody can decode
because it switches encodings without notice. That's a viable model if
your OS guarantees it (Mac OS X does, for example, so Python 3 assumes
UTF-8 for all OS interfaces there), but Linux currently has no such
guarantee - many runtimes just decide they don't care, and assume
UTF-8 anyway (Python 3 may even join them some day, due to the
problems caused by trusting the locale encoding to be correct, but the
startup code will need non-trivial changes for that to happen - the
C.UTF-8 locale may even become widespread before we get there).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stefan Krah
Paul Sokolovsky  wrote:
> In this regard, I'm glad to participate in mind-resetting discussion.
> So, let's reiterate - there's nothing like "the best", "the only right",
> "the only correct", "righter than", "more correct than" in CPython's
> implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
> not arbitrary, but based on requirements, and these requirements match
> CPython's (implied) usage model well enough. But among all possible
> sets of requirements, CPython's requirements are no more valid that
> other possible. And other set of requirement fairly clearly lead to
> situation where CPython implementation is rejected as not correct for
> those requirements at all.

Several core-devs have said that using UTF-8 for MicroPython is perfectly okay.
I also think it's the right choice and I hope that you guys come up with a very
efficient implementation.


Stefan Krah


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 21:43:16 +1000
Nick Coghlan  wrote:

> On 5 June 2014 21:25, Paul Sokolovsky  wrote:
> > Well, I understand the plan - hoping that people will "get over
> > this". And I'm personally happy to stay away from this "trolling",
> > but any discussion related to Unicode goes in circles and returns
> > to feeling that Unicode at the central role as put there by Python3
> > is misplaced.
> 
> Many of the challenges network programmers face in Python 3 are around
> binary data being more inconvenient to work with than it needs to be,
> not the fact we decentralised boundary code by offering a strict
> binary/text separation as the default mode of operation. 

Just to clarify - (many) other gentlemen and I (in that order, I'm not
taking a lead), don't call to go back to Python2 behavior with implicit
conversion between byte-oriented strings and Unicode, etc. They just
point out that perhaps Python3 went too far with Unicode cause by making
it the default string type. Strict separation is surely mostly good
thing (I can sigh that it leads to Java-like dichotomical bloat for all
I/O classes, but well, I was able to put up with that in MicroPython
already).

> Aside from
> some of the POSIX locale handling issues on Linux, many of the
> concerns are with the usability of bytes and bytearray, not with str -
> that's why binary interpolation is coming back in 3.5, and there will
> likely be other usability tweaks for those types as well.

All these changes are what let me dream on and speculate on
possibility that Python4 could offer an encoding-neutral string type
(which means based on bytes), while move unicode back to an explicit
type to be used explicitly only when needed (bloated frameworks like
Django can force users to it anyway, but that will be forcing on
framework level, not on language level, against which people rebel.)
People can dream, right?


Thanks,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 21:25, Paul Sokolovsky  wrote:
> Well, I understand the plan - hoping that people will "get over this".
> And I'm personally happy to stay away from this "trolling", but any
> discussion related to Unicode goes in circles and returns to feeling
> that Unicode at the central role as put there by Python3 is misplaced.

Many of the challenges network programmers face in Python 3 are around
binary data being more inconvenient to work with than it needs to be,
not the fact we decentralised boundary code by offering a strict
binary/text separation as the default mode of operation. Aside from
some of the POSIX locale handling issues on Linux, many of the
concerns are with the usability of bytes and bytearray, not with str -
that's why binary interpolation is coming back in 3.5, and there will
likely be other usability tweaks for those types as well.

More on that at
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Nick Coghlan
On 5 June 2014 17:54, Stephen J. Turnbull  wrote:
> What matters to you is that str (unicode) is an opaque type -- there
> is no specification of the internal representation in the language
> reference, and in fact several different ones coexist happily across
> existing Python implementations -- and you're free to use a UTF-8
> implementation if that suits the applications you expect for
> MicroPython.

However, as others have noted in the thread, the critical thing is to
*not* let that internal implementation detail leak into the Python
level string behaviour. That's what happened with narrow builds of
Python 2 and pre-PEP-393 releases of Python 3 (effectively using
UTF-16 internally), and it was the cause of a sufficiently large
number of bugs that the Linux distributions tend to instead accept the
memory cost of using wide builds (4 bytes for all code points) for
affected versions.

Preserving the "the Python 3 str type is an immutable array of code
points" semantics matters significantly more than whether or not
indexing by code point is O(1). The various caching tricks suggested
in this thread (especially "leading ASCII characters", "trailing ASCII
characters" and "position & index of last lookup") could keep the
typical lookup performance well below O(N).

> PEP 393 exists, of course, and specifies the current internal
> representation for CPython 3.  But I don't see anything in it that
> suggests it's mandated for any other implementation.

CPython is constrained by C API compatibility requirements, as well as
implementation constraints due to the amount of internal code that
would need to be rewritten to handle a variable width encoding as the
canonical internal representation (since the problems with Python 2
narrow builds mean we already know variable width encodings aren't
handled correctly by the current code).

Implementations that share code with CPython, or try to mimic the C
API especially closely, may face similar restrictions. Outside that, I
think we're better off if alternative implementations are free to
experiment with different internal string representations.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Thu, 05 Jun 2014 16:54:11 +0900
"Stephen J. Turnbull"  wrote:

> Paul Sokolovsky writes:
> 
>  > Please put that in perspective when alarming over O(1) indexing of
>  > inherently problematic niche datatype. (Again, it's not my or
>  > MicroPython's fault that it was forced as standard string type.
>  > Maybe if CPython seriously considered now-standard UTF-8 encoding,
>  > results of what is "str" type might be different. But CPython has
>  > gigabytes of heap to spare, and for MicroPython, every half-bit is
>  > precious).
> 
> Would you please stop trolling?  The reasons for adopting Unicode as a
> separate data type were good and sufficient in 2000, and they remain

If it was kept at "separate data type" bay, there wouldn't be any
problem. But it was made "one and only string type", and all strife
started then.

And there going to be "trolling" as long as Python developers and
decision-makers will ignore (troll?) outcry from the community (again, I
was surprised and not surprised to see ~50% of traffic on python-list
touches Unicode issues). 

Well, I understand the plan - hoping that people will "get over this".
And I'm personally happy to stay away from this "trolling", but any
discussion related to Unicode goes in circles and returns to feeling
that Unicode at the central role as put there by Python3 is misplaced.

Then for me, it's just a matter of job security and personal future - I
don't want to spend rest of my days as a javascript (or other idiotic
language) monkey. And the message is clear in the air
(http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ and
elsewhere): if Python strings are now in Go, and in Python itself are
now Java strings, all causing strife, why not go cruising around and see
what's up, instead of staying strong, and growing bigger, community.

> so today, even if you have been fortunate enough not to burn yourself
> on character-byte conflation yet.
> 
> What matters to you is that str (unicode) is an opaque type -- there
> is no specification of the internal representation in the language
> reference, and in fact several different ones coexist happily across
> existing Python implementations -- and you're free to use a UTF-8
> implementation if that suits the applications you expect for
> MicroPython.
> 
> PEP 393 exists, of course, and specifies the current internal
> representation for CPython 3.  But I don't see anything in it that
> suggests it's mandated for any other implementation.

I knew all this before very well. What's strange is that other
developers don't know, or treat seriously, all of the above. That's why
gentleman who kindly was interested in adding Unicode support to
MicroPython started with the idea of dragging in CPython implementation.
And the only effect persuasion that it's not necessarily the best
solution had, was that he started to feel that he's being manipulated
into writing something ugly, instead of the bright idea he had.

That's why another gentleman reduces it to: "O(1) on string indexing or
not a Python!". 

And that's why another gentleman, who agrees to UTF-8 arguments, still
gives an excuse
(https://mail.python.org/pipermail/python-dev/2014-June/134727.html):
"In this context, while a fixed-width encoding may be the correct
choice it would also likely be the wrong choice."


In this regard, I'm glad to participate in mind-resetting discussion.
So, let's reiterate - there's nothing like "the best", "the only right",
"the only correct", "righter than", "more correct than" in CPython's
implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
not arbitrary, but based on requirements, and these requirements match
CPython's (implied) usage model well enough. But among all possible
sets of requirements, CPython's requirements are no more valid that
other possible. And other set of requirement fairly clearly lead to
situation where CPython implementation is rejected as not correct for
those requirements at all.



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy  wrote:

> On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:
> 
> > "Well" is subjective (or should be defined formally based on the
> > requirements). With my MicroPython hat on, an implementation which
> > receives a string, transcodes it, leading to bigger size, just to
> > immediately transcode back and send out - is awful, environment
> > unfriendly implementation ;-).
> 
> I am not sure what you concretely mean by 'receive a string', but I 

I (surely) mean an abstract input (as an Input/Output aka I/O)
operation.

> think you are again batting at a strawman. If you mean 'read from a 
> file', and all you want to do is read bytes from and write bytes to 
> external 'files', then there is obviously no need to transcode and 
> neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use "binary data" type for them,
with all attached funny things, like "b" prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).

[]



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Serhiy Storchaka writes:

 > Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is 
 > used instead of UCS4) is the better choice for CPython. I suppose that 
 > with populating emoticons and other icon characters in nearest 5 or 10 
 > years, even English text will often contain astral characters. And 
 > spending 4 bytes per character if long text contains one astral 
 > character looks too prodigally.

Why use something that complex if you don't have to?  For the use case
you have in mind, just map them into private space.  If you really
want to be aggressive, use surrogate space, too (anything that cares
what a scalar represents should be trapping on non-scalars, catch that
exception and look up the char -- dangerous, though, because such
exceptions are probably all over the place).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

05.06.14 05:25, Terry Reedy написав(ла):

I mentioned it as an alternative during the '393 discussion. I more than
half agree that the FSR is the better choice for CPython, which had no
particular attachment to UTF-16 in the way that I think Jython, for
instance, does.


Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is 
used instead of UCS4) is the better choice for CPython. I suppose that 
with populating emoticons and other icon characters in nearest 5 or 10 
years, even English text will often contain astral characters. And 
spending 4 bytes per character if long text contains one astral 
character looks too prodigally.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Paul Sokolovsky writes:

 > Please put that in perspective when alarming over O(1) indexing of
 > inherently problematic niche datatype. (Again, it's not my or
 > MicroPython's fault that it was forced as standard string type. Maybe
 > if CPython seriously considered now-standard UTF-8 encoding, results
 > of what is "str" type might be different. But CPython has gigabytes of
 > heap to spare, and for MicroPython, every half-bit is precious).

Would you please stop trolling?  The reasons for adopting Unicode as a
separate data type were good and sufficient in 2000, and they remain
so today, even if you have been fortunate enough not to burn yourself
on character-byte conflation yet.

What matters to you is that str (unicode) is an opaque type -- there
is no specification of the internal representation in the language
reference, and in fact several different ones coexist happily across
existing Python implementations -- and you're free to use a UTF-8
implementation if that suits the applications you expect for
MicroPython.

PEP 393 exists, of course, and specifies the current internal
representation for CPython 3.  But I don't see anything in it that
suggests it's mandated for any other implementation.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

04.06.14 23:50, Glenn Linderman написав(ла):

3) (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in
either direction, so "next/previous" codepoint access would be
relatively fast (and such are very common operations, even when indexing
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)


Great idea! It should cover most real-word cases. Note that we can scan 
UTF-8 string left-to-right and right-to-left.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

05.06.14 03:03, Greg Ewing написав(ла):

Serhiy Storchaka wrote:

html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't
use iterators. They use indices, str.find and/or regular expressions.
Common use case is quickly find substring starting from current
position using str.find or re.search, process found token, advance
position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


Of course. But _existing_ Python interfaces all work with indices. And 
it is too late to change this, this train was gone 20 years ago.


There is no need in yet one way to do string operations. One obvious way 
is enough.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Serhiy Storchaka

05.06.14 03:08, Greg Ewing написав(ла):

Serhiy Storchaka wrote:

A language which doesn't support O(1) indexing is not Python, it is
only Python-like language.


That's debatable, but even if it's true, I don't think
there's anything wrong with MicroPython being only a
"Python-like language". As has been pointed out, fitting
Python onto a small device is always going to necessitate
some compromises.


Agree, there's anything wrong. I think that even limiting integers to 32 
or 64 bits is acceptable compromise for Python-like language targeted to 
small devices. But programming on such language requires different 
techniques and habits.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-05 Thread Stephen J. Turnbull
Glenn Linderman writes:

 > 3) (Most space efficient) One cached entry, that caches the last 
 > codepoint/byte position referenced. UTF-8 is able to be traversed in 
 > either direction, so "next/previous" codepoint access would be 
 > relatively fast (and such are very common operations, even when indexing 
 > notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

Been there, tried that (Emacsen).  Either it's a YAGNI (moving forward
or backward over UTF-8 by characters short distances is plenty fast,
especially if you've got a lot of ASCII you can move by words for
somewhat longer distances), or it's not good enough.  There *may* be a
sweet spot, but it's definitely smaller than the one on Sharapova's
racket.

 > 4) (Fixed size caches)  N entries, one for the last codepoint, and 
 > others at Codepoint_Length/N intervals.  N could be tunable.

To achieve space saving, cache has to be quite small, and the bigger
your integers, the smaller it gets.  A naive implementation on 64-bit
machine would give you 16 bytes/cache entry.  Using a non-native size
will be a space win, but needs care in implementation.  Initializing
the cache is very expensive for small strings, so you need conditional
and maybe lazy initialization (for large strings).

By the way, there's also

10) Keep counts of the leading and trailing number of ASCII
(one-octet) characters.  This is often a *huge* win; it's quite
common to encounter documents where size - lc - tc = 2 (ie,
there's only one two-octet character in the document).

11) Keep a list (or tree) of most-recently-accessed positions.

Despite my negative experience with multibyte encodings in Emacsen,
I'm persuaded by the arguments that there probably aren't all that
many places in core Python where indexing is used in an essential way,
so MicroPython itself can probably optimize those "behind the
scenes".  Application programmers in the embedded context may be
expected to be deal with the need to avoid random access algorithms
and use iterators and generators to accomplish most tasks.




___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy

On 6/4/2014 6:54 PM, Serhiy Storchaka wrote:

05.06.14 00:21, Terry Reedy написав(ла):

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars,
and is usually small.


I like your idea and think it would be great if Jython will implement
it.


A proof of concept implementation in Python that handles both indexing 
and slicing is on the tracker. It is simpler than I initially expected.


> Unfortunately it is too late to do this in CPython.

I mentioned it as an alternative during the '393 discussion. I more than 
half agree that the FSR is the better choice for CPython, which had no 
particular attachment to UTF-16 in the way that I think Jython, for 
instance, does.


--
Terry Jan Reedy



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy

On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:


"Well" is subjective (or should be defined formally based on the
requirements). With my MicroPython hat on, an implementation which
receives a string, transcodes it, leading to bigger size, just to
immediately transcode back and send out - is awful, environment
unfriendly implementation ;-).


I am not sure what you concretely mean by 'receive a string', but I 
think you are again batting at a strawman. If you mean 'read from a 
file', and all you want to do is read bytes from and write bytes to 
external 'files', then there is obviously no need to transcode and 
neither Python 2 or 3 make you do so.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Thu, 05 Jun 2014 12:08:21 +1200
Greg Ewing  wrote:

> Serhiy Storchaka wrote:
> > A language which doesn't support O(1) indexing is not Python, it is
> > only Python-like language.
> 
> That's debatable, but even if it's true, I don't think
> there's anything wrong with MicroPython being only a
> "Python-like language". As has been pointed out, fitting
> Python onto a small device is always going to necessitate
> some compromises.

Thanks. I mentioned in another mail that we exactly trying to develop a
minimalistic, but Python implementation, not Python-like language.

What is "Python-like" for me. The other most well-know, and mature (as
in "started quite some time ago") "small Python" implementation is
PyMite aka Python-on-a-chip
https://code.google.com/p/python-on-a-chip/ . It implements good deal
of Python2 language. It doesn't implement exception handling
(try/except). Can a Python be without exception handling? For me,
the clear answer is "no".

Please put that in perspective when alarming over O(1) indexing of
inherently problematic niche datatype. (Again, it's not my or
MicroPython's fault that it was forced as standard string type. Maybe
if CPython seriously considered now-standard UTF-8 encoding, results
of what is "str" type might be different. But CPython has gigabytes of
heap to spare, and for MicroPython, every half-bit is precious).


> 
> -- 
> Greg
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico
On Thu, Jun 5, 2014 at 10:03 AM, Greg Ewing  wrote:
> StringPositions could support the following operations:
>
>StringPosition + int --> StringPosition
>StringPosition - int --> StringPosition
>StringPosition - StringPosition --> int
>
> These would be computed by counting characters forwards
> or backwards in the string, which would be slower than
> int arithmetic but still faster than counting from the
> beginning of the string every time.

The SP would have to keep track of which string it's associated with,
which might make for some surprising retentions of large strings.
(Imagine returning what you think is an integer, but actually turns
out to be a SP, and you're trying to work out why your program is
eating up so much more memory than it should. This int-like object is
so much more than an int.)

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Thu, 05 Jun 2014 12:03:17 +1200
Greg Ewing  wrote:

> Serhiy Storchaka wrote:
> > html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> > don't use iterators. They use indices, str.find and/or regular
> > expressions. Common use case is quickly find substring starting
> > from current position using str.find or re.search, process found
> > token, advance position and repeat.
> 
> For that kind of thing, you don't need an actual character
> index, just some way of referring to a place in a string.
> 
> Instead of an integer, str.find() etc. could return a
> StringPosition, 

That's more brave then I had in mind, but definitely shows what
alternative implementation have in store to fight back if some
perfomance problems are actually detected. My own thoughts were, for
example, as response to people who (quoting) "slice strings for living"
is some form of "extended slicing" like str[(0, 4, 6, 8, 15)].

But I really think that providing iterator interface for common string
operations would cover most of real-world cases, and will be actually
beneficial for Python language in general.

> 
> -- 
> Greg


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing

Glenn Linderman wrote:


so algorithms that walk two strings at a time cannot use the same 
StringPosition to do so... yep, this is quite divergent from CPython and 
Python.


They can, it's just that at most one of the indexing
operations would be fast; the StringPosition would
devolve into an int for the other one.

Such an algorithm would be of dubious correctness
anyway, since as you pointed out, codepoints and
characters are not quite the same thing. A codepoint
index in one string doesn't necessarily count off
the same number of characters in another string.
So to be safe, you should really walk each string
individually.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing

Glenn Linderman wrote:



For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.


Probably, but what I said is true either way.


This starts to diverge from Python codepoint indexing via integers.


That's true, although most programs would have to go
out of their way to tell the difference, especially if
StringPosition were a subclass of int.

I agree that cacheing indexes would be more transparent,
though.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman

On 6/4/2014 5:08 PM, Glenn Linderman wrote:

On 6/4/2014 5:03 PM, Greg Ewing wrote:

Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize 
don't use iterators. They use indices, str.find and/or regular 
expressions. Common use case is quickly find substring starting from 
current position using str.find or re.search, process found token, 
advance position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.



Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

This starts to diverge from Python codepoint indexing via integers. 
Calculating or caching the codepoint index to byte offset as part of 
the str implementation stays compatible with Python. Introducing 
StringPosition makes a Python-like language. Or so it seems to me.


Another thought is that StringPosition only works (quickly, at least), 
as you point out, for the string that they were derived from... so 
algorithms that walk two strings at a time cannot use the same 
StringPosition to do so... yep, this is quite divergent from CPython and 
Python.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman

On 6/4/2014 5:03 PM, Greg Ewing wrote:

Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize 
don't use iterators. They use indices, str.find and/or regular 
expressions. Common use case is quickly find substring starting from 
current position using str.find or re.search, process found token, 
advance position and repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.


I think you meant codepoint index, rather than character index.



Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

This starts to diverge from Python codepoint indexing via integers. 
Calculating or caching the codepoint index to byte offset as part of the 
str implementation stays compatible with Python. Introducing 
StringPosition makes a Python-like language. Or so it seems to me.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing

Serhiy Storchaka wrote:
A language which doesn't support O(1) indexing is not Python, it is only 
Python-like language.


That's debatable, but even if it's true, I don't think
there's anything wrong with MicroPython being only a
"Python-like language". As has been pointed out, fitting
Python onto a small device is always going to necessitate
some compromises.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Greg Ewing

Serhiy Storchaka wrote:
html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't 
use iterators. They use indices, str.find and/or regular expressions. 
Common use case is quickly find substring starting from current position 
using str.find or re.search, process found token, advance position and 
repeat.


For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.

Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

   StringPosition + int --> StringPosition
   StringPosition - int --> StringPosition
   StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Eric Snow
On Wed, Jun 4, 2014 at 5:11 PM, Paul Sokolovsky  wrote:
> On Wed, 4 Jun 2014 16:12:23 -0600
> Eric Snow  wrote:
>> Actually, there is a "formal, implementation-independent language
>> spec":
>>
>> https://docs.python.org/3/reference/
>
> Opening that link in browser, pressing Ctrl+F and pasting your quote
> gives zero hits, so it's not exactly what you claim it to be. It's also
> pretty far from being formal (unambiguous, covering all choices, etc.)
> and comprehensive. Also, please point me at "conformance" section.
>
> That said, all of us Pythoneers treat it as the best formal reference
> available, no news here.

It's not just the best formal reference.  It's the official
specification.  I agree it is not so "formal" as other language
specifications and it does not enumerate every facet of the language.
However, underspecified parts are worth improving (as we've done with
the import system portion in the last few years).  Incidentally, the
efforts of other Python implementors have often resulted in such
improvements to the language reference.  Those improvements typically
come as a result of questions to this very list. :)  That's
essentially what this email thread is!

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

05.06.14 00:21, Terry Reedy написав(ла):

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars,
and is usually small.


I like your idea and think it would be great if Jython will implement 
it. Unfortunately it is too late to do this in CPython.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 4 Jun 2014 16:12:23 -0600
Eric Snow  wrote:

> On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky 
> wrote:
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
> 
> Actually, there is a "formal, implementation-independent language
> spec":
> 
> https://docs.python.org/3/reference/

Opening that link in browser, pressing Ctrl+F and pasting your quote
gives zero hits, so it's not exactly what you claim it to be. It's also
pretty far from being formal (unambiguous, covering all choices, etc.)
and comprehensive. Also, please point me at "conformance" section.

That said, all of us Pythoneers treat it as the best formal reference
available, no news here.

> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
> 
> To a large extent the test suite in
> http://hg.python.org/cpython/file/default/Lib/test effectively
> validates (full) compliance with the corresponding release (change
> "default" to the release branch of your choice).  With that goal, no
> small effort has been made to mark implementation-specific tests as
> such.  So uPy could consider using the test suite (and explicitly skip
> the tests for features that uPy doesn't support).

That's exactly what we do, per the previous paragraph. And we face a
lot of questionable tests, just like you say. Shameless plug: if anyone
interested to run existing code on MicroPython, please help us with
CPython testsuite! ;-)

> 
> -eric



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico
On Thu, Jun 5, 2014 at 8:52 AM, Paul Sokolovsky  wrote:
> "Well" is subjective (or should be defined formally based on the
> requirements). With my MicroPython hat on, an implementation which
> receives a string, transcodes it, leading to bigger size, just to
> immediately transcode back and send out - is awful, environment
> unfriendly implementation ;-).

Be careful of confusing correctness and performance, though. The
transcoding you describe is inefficient, but (presumably) correct;
something that's fast but wrong is straight-up buggy. You can always
fix inefficiency in a later release, but buggy behaviour sometimes is
relied on (which is why ECMAScript still exposes UTF-16 to scripts,
and why Windows window messages have a WPARAM and an LPARAM, and why
Python's threading module has duplicate names for a lot of functions,
because it's just not worth changing). I'd be much more comfortable
releasing something where "everything works fine, but if you use
astral characters in your strings, memory usage blows out by a factor
of four" (or "... the len() function takes O(N) time") than one where
"everything works fine as long as you use BMP only, but SMP characters
result in tests failing".

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

05.06.14 01:04, Terry Reedy написав(ла):

PS. You do not seem to be aware of how well the current PEP393
implementation works. If you are going to write any more about it, I
suggest you run Tools/Stringbench/stringbench.py for timings.


AFAIK stringbench is ASCII-only, so it likely is compatible with current 
and any future MicroPython implementations, but unlikely will expose 
non-ASCII limitations or performance.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 18:04:52 -0400
Terry Reedy  wrote:

> On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:
> 
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
> >
> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
> 
> I would call what you are doing a 'Python 3.n subset, with

Thanks, that's what we call it ourselves in the docs linked in the
original message, and use n=4. Note that being a subset is not a design
requirement, but there's higher-priority requirement of staying lean,
so realistically uPy will always stay a subset.

> limitations', where n should be a specific number, which I would urge
> should be at least 3, if not 4 ('yield from'). To me, that would mean
> that every Micropython program (that does not use a clearly
> non-Python addon like inline assembly) would run the same* on CPython
> 3.n. Conversely, a Python 3.n program should either run the same* on
> MicroPython as CPython, or raise. What most to avoid is giving
> different* answers.

That's nice aim, to implement which we don't have enough resources, so
would appreciate any help from interested parties.

> *'same' does not include timing differences or normal float
> variations or bug fixes in MicroPython not in CPython.
> 
> As for unicode: I would see ascii-only (very limited codepoints) or
> bare utf-8 (limited speed == expanded time) as possibly fitting the 
> definition above. Just be clear what the limitations are. And accept 
> that there will be people who do not bother to read the limitations
> and then complain when they bang into them.
> 
> PS. You do not seem to be aware of how well the current PEP393 
> implementation works. If you are going to write any more about it, I 
> suggest you run Tools/Stringbench/stringbench.py for timings.

"Well" is subjective (or should be defined formally based on the
requirements). With my MicroPython hat on, an implementation which
receives a string, transcodes it, leading to bigger size, just to
immediately transcode back and send out - is awful, environment
unfriendly implementation ;-).


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Eric Snow
On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky  wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

Actually, there is a "formal, implementation-independent language spec":

https://docs.python.org/3/reference/

>
>> Realistically, most Python code that works on Python 3.4 won't work
>> on Micropython (for various reasons, not just the string behavior)
>> and neither does it need to.
>
> That's true. However, as was said, we're striving to provide a
> compatible implementation, and compatibility claims must be validated.
> While we have simple "in-house" testsuite, more serious compatibility
> validation requires running a testsuite for reference implementation
> (CPython), and that's gradually being approached.

To a large extent the test suite in
http://hg.python.org/cpython/file/default/Lib/test effectively
validates (full) compliance with the corresponding release (change
"default" to the release branch of your choice).  With that goal, no
small effort has been made to mark implementation-specific tests as
such.  So uPy could consider using the test suite (and explicitly skip
the tests for features that uPy doesn't support).

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy

On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:


That said, and unlike previous attempts to develop a small Python
implementations (which of course existed), we're striving to be exactly
a Python language implementation, not a Python-like language
implementation. As there's no formal, implementation-independent
language spec, what constitutes a compatible language implementation is
subject to opinions, and we welcome and appreciate independent review,
like this thread did.


Realistically, most Python code that works on Python 3.4 won't work
on Micropython (for various reasons, not just the string behavior)
and neither does it need to.


That's true. However, as was said, we're striving to provide a
compatible implementation, and compatibility claims must be validated.
While we have simple "in-house" testsuite, more serious compatibility
validation requires running a testsuite for reference implementation
(CPython), and that's gradually being approached.


I would call what you are doing a 'Python 3.n subset, with limitations', 
where n should be a specific number, which I would urge should be at 
least 3, if not 4 ('yield from'). To me, that would mean that every 
Micropython program (that does not use a clearly non-Python addon like 
inline assembly) would run the same* on CPython 3.n. Conversely, a 
Python 3.n program should either run the same* on MicroPython as 
CPython, or raise. What most to avoid is giving different* answers.


*'same' does not include timing differences or normal float variations 
or bug fixes in MicroPython not in CPython.


As for unicode: I would see ascii-only (very limited codepoints) or bare 
utf-8 (limited speed == expanded time) as possibly fitting the 
definition above. Just be clear what the limitations are. And accept 
that there will be people who do not bother to read the limitations and 
then complain when they bang into them.


PS. You do not seem to be aware of how well the current PEP393 
implementation works. If you are going to write any more about it, I 
suggest you run Tools/Stringbench/stringbench.py for timings.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman

On 6/4/2014 2:28 PM, Chris Angelico wrote:

On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman  wrote:

8) (Content specific variable size caches)  Index each codepoint that is a
different byte size than the previous codepoint, allowing indexing to be
used in the intervals. Worst case size is like 2, best case size is a single
entry for the end, when all code points are represented by the same number
of bytes.

Conceptually interesting, and I'd love to know how well that'd perform
in real-world usage.


So would I :)


Would do very nicely on blocks of text that are
all from the same range of codepoints, but if you intersperse high and
low codepoints it'll be like 2 but with significantly more complicated
lookups (imagine a "name=value\nname=value\n" stream where the names
and values are all in the same language - you'll have a lot of
transitions).


Lookup is binary search on code point index or a search for same in some 
tree structure, I would think.


"like 2 but ..." well, the data structure would be bigger than for 2, 
but your example shows 4-5 high codepoints per low codepoint (for some 
languages).


I did just think of another refinement to this technique (my list was 
not intended to be all-inclusive... just a bunch of variations I thought 
of then).


10) (Content specific variable size caches) Like 8, but the last 
character in a run is allowed (but not required) to be a different 
number of bytes than prior characters, because the offset calculation 
will still work for the first character of a different size.


So #10 would halve the size of your imagined stream that intersperses 
one low-byte charater with each sequence of high-byte characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread R. David Murray
On Thu, 05 Jun 2014 00:14:32 +0300, Paul Sokolovsky  wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

The language reference is also the language specification.  I don't
know what you mean by 'formal', so presumably it doesn't qualify
:)  That said, if there are places that are not correctly marked as
implementation specific, those are bugs in the reference and should
be fixed.  There almost certainly are still such bugs, and I suspect
MicroPython can help us fix them, just as PyPy did/does.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico
On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman  wrote:
> 8) (Content specific variable size caches)  Index each codepoint that is a
> different byte size than the previous codepoint, allowing indexing to be
> used in the intervals. Worst case size is like 2, best case size is a single
> entry for the end, when all code points are represented by the same number
> of bytes.

Conceptually interesting, and I'd love to know how well that'd perform
in real-world usage. Would do very nicely on blocks of text that are
all from the same range of codepoints, but if you intersperse high and
low codepoints it'll be like 2 but with significantly more complicated
lookups (imagine a "name=value\nname=value\n" stream where the names
and values are all in the same language - you'll have a lot of
transitions).

Chrisa
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.


Indexing can be made O(log(k)) where k is the number of astral chars, 
and is usually small.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Terry Reedy

On 6/4/2014 3:41 AM, Jeff Allen wrote:

Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0x) characters.

I've toyed with making this O(1) universally. Like Steven, I understand
this to be a freedom afforded to implementers, rather than an issue of
conformity.

Jeff Allen

On 04/06/2014 02:17, Steven D'Aprano wrote:

There is a discussion over at MicroPython about the internal
representation of Unicode strings.

...

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.






--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 4 Jun 2014 11:25:51 -0700
Guido van Rossum  wrote:

> This thread has devolved into a flame war. I think we should trust the
> Micropython implementers (whoever they are -- are they participating
> here?) 

I'm a regular contributor. I'm not sure if the author, Damien George,
is on the list. In either case, he's a nice guy who prefer to do
development rather than participate in flame wars ;-). And for the
record, all opinions expressed are solely mine, and not official
position of MicroPython project.

> to know their users and let them do what feels right to them.
> We should just ask them not to claim full compatibility with any
> particular Python version -- that seems the most contentious point.

"Full" compatibility is never claimed, and understanding it as such is
optimistic, "between the lines" reading of some users. All of:
announcement posted on python-list (which prompted current inflow of
MicroPython-related discussions), README at
https://github.com/micropython/micropython , and detailed differences
doc https://github.com/micropython/micropython/wiki/Differences make it
clear there's no talk about "full" compatibility, and only specific
compatibility (and incompatibility) points are claimed.

That said, and unlike previous attempts to develop a small Python
implementations (which of course existed), we're striving to be exactly
a Python language implementation, not a Python-like language
implementation. As there's no formal, implementation-independent
language spec, what constitutes a compatible language implementation is
subject to opinions, and we welcome and appreciate independent review,
like this thread did.

> Realistically, most Python code that works on Python 3.4 won't work
> on Micropython (for various reasons, not just the string behavior)
> and neither does it need to.

That's true. However, as was said, we're striving to provide a
compatible implementation, and compatibility claims must be validated.
While we have simple "in-house" testsuite, more serious compatibility
validation requires running a testsuite for reference implementation
(CPython), and that's gradually being approached.

> 
> -- 
> --Guido van Rossum (python.org/~guido)



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Glenn Linderman

On 6/4/2014 6:14 AM, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an 
attempted optimization that is likely to be incorrect for a set of 
strings. Splitting, regex, concatenation and formatting are really the 
main operations that matter, and MicroPython can optimize their 
implementation of these easily enough for O(N) indexing.


Cheers,
Steve

Top-posted from my Windows Phone

From: Daniel Holth <mailto:dho...@gmail.com>
Sent: ‎6/‎4/‎2014 5:17
To: Paul Sokolovsky <mailto:pmis...@gmail.com>
Cc: python-dev <mailto:python-dev@python.org>
Subject: Re: [Python-Dev] Internal representation of strings and 
Micropython


If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.


I think native UTF-8 support is the most promising route for a 
micropython Unicode support.


It would be an interesting proof-of-concept to implement an alternative 
CPython with PEP-393 replaced by UTF-8 internally... doing conversions 
for APIs that require a different encoding, but always maintaining and 
computing with the UTF-8 representation.


1) The first proof-of-concept implementation should implement codepoint 
indexing as a O(N) operation, searching from the beginning of the string 
for the Nth codepoint.


Other Proof-of-concept implementation could implement a codepoint 
boundary cache, there could be a variety of caching algorithms.


2) (Least space efficient) An array that could be indexed by codepoint 
position and result in byte position. (This would use more space than a 
UTF-32 representation!)


3) (Most space efficient) One cached entry, that caches the last 
codepoint/byte position referenced. UTF-8 is able to be traversed in 
either direction, so "next/previous" codepoint access would be 
relatively fast (and such are very common operations, even when indexing 
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)


4) (Fixed size caches)  N entries, one for the last codepoint, and 
others at Codepoint_Length/N intervals.  N could be tunable.


5) (Fixed size caches)  Like 4, plus an extra entry like 3.

6) (Variable size caches)  Like 2, but only indexing every  Nth code 
point.  N could be tunable.


7) (Variable size caches)  Like 6, plus an extra entry like 3.

8) (Content specific variable size caches)  Index each codepoint that is 
a different byte size than the previous codepoint, allowing indexing to 
be used in the intervals. Worst case size is like 2, best case size is a 
single entry for the end, when all code points are represented by the 
same number of bytes.


9) (Content specific variable size caches)  Like 8, only cache entries 
could indicate fixed or variable size characters in the next interval, 
with a scheme like 4 or 6 used to prevent one interval from covering the 
whole string.


Other hybrid schemes may present themselves as useful once experience is 
gained with some of these. It might be surprising how few algorithms 
need more than algorithm 3 to get reasonable performance.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano
On Wed, Jun 04, 2014 at 03:32:25PM +, Steve Dower wrote:
> Steven D'Aprano wrote:
> > The language semantics says that a string is an array of code points. Every
> > index relates to a single code point, no code point extends over two or more
> > indexes.
> > There's a 1:1 relationship between code points and indexes. How is direct
> > indexing "likely to be incorrect"?
> 
> We're discussing the behaviour under a different (hypothetical) design 
> decision than a 1:1 relationship between code points and indexes, so 
> arguing from that stance doesn't make much sense.

I'm open to different implementations. I earlier even suggested that the 
choice of O(1) indexing versus O(N) indexing was a quality of 
implementation issue, not a make-or-break issue for whether something 
can call itself Python (or even 99% compatible with Python").

But I don't believe that exposing that implementation at the Python 
level is valid: regardless of whether it is efficient or not, I should 
be able to write code like this:

a = [mystring[i] for i in range(len(mystring))]
b = list(mystring)
assert a == b

That is not the case if you expose the underlying byte-level 
implementation at the Python level, and treat strings as an array of 
*bytes*. Paul seems to want to do this, or at least he wants Python 4 
to do this. I think it is *completely* inappropriate to do so.

I *think* you may agree with me, (correct me if I'm wrong) because you 
go on to agree with me:

> > e.g.
> > 
> > s = "---ÿ---"
> > offset = s.index('ÿ')
> > assert s[offset] == 'ÿ'
> > 
> > That cannot fail with Python's semantics.
> 
> Agreed, and it shouldn't 

but I'm not actually sure.


> (I was actually referring to the optimization 
> being incorrect for the goal, not the language semantics). What you'd 
> probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may 
> be surprising, but is also correct.

You don't seem to be taking about sys.getsizeof, so I guess you're 
talking about something at the C level (or other underlying 
implementation), ignoring the object overhead. I don't know why you 
think I'd find that surprising -- one cannot fit 0x10 Unicode code 
points in a single byte, so whether you use UTF-32, UTF-16, UTF-8, 
Python 3.3's FSR or some other implementation, at least some code points 
are going to use more than one byte.


> But what are you trying to achieve (why are you writing this code)? 
> All this example really shows is that you're only using indexing for 
> trivial purposes.

I'm trying to understand what point you are trying to make, because I'm 
afraid I don't quite get it.


[...]
> If copying into a separate list is a problem (memory-wise), 
> re.finditer('\\S+', string) also provides the same behaviour and gives 
> me the sliced string, so there's no need to index for anything.

finditer returns a bunch of MatchObjects, which give you the indexes 
of the found substring. Whether you do it yourself, or get the re 
module to do it, you're indexing somewhere.


> The downside is that it isn't as easy to teach as the 1:1 
> relationship, and currently it doesn't perform as well *in CPython*. 
> But if MicroPython is focusing on size over speed, I don't see any 
> reason why they shouldn't permit different performance characteristics 
> and require a slightly different approach to highly-optimized coding.

I don't have a problem with different implementations, so long as that 
implementation isn't exposed at the Python level with changes of 
semantics such as breaking the promise that a string is an array of code 
points, not of bytes.

> In any case, this is an interesting discussion with a genuine effect 
> on the Python interpreter ecosystem. Jython and IronPython already 
> have different string implementations from CPython - having official 
> (and hopefully flexible) guidance on deviations from the reference 
> implementation would I think help other implementations provide even 
> more value, which is only a good thing for Python.

Yes, agreed.



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 20:52:14 +0300
Serhiy Storchaka  wrote:

[]
> > That's sad, I agree.
> 
> Other languages (Go, Rust) can be happy without O(1) indexing of 
> strings. All string and regex operations work with iterators or
> cursors, and I believe this approach is not significant worse than
> implementing strings as O(1)-indexable arrays of characters (for some
> applications it can be worse, for other it can be better). But Python
> is different language, it has different operations for strings and
> different idioms. A language which doesn't support O(1) indexing is
> not Python, it is only Python-like language.

Sorry, but that's just your personal opinion, not shared by other
developers, as this thread showed. And let's not pretend we live in
happy-ever world of Python 1.5.2 which doesn't need anything more
because it's perfect as it is. Somebody added all those iterators and
iterator-returning functions to Pythons. And then the problem Python
has is a typical "last mile" problem, that iterators were not applied
completely everywhere. There's little choice but to move in that
direction, though.

What you call "idioms", other people call "sloppy programming
practices". There's common suggestion how to be at peace with Python's
indentation for those who find it a problem - "get over it". Well,
somehow it itches to say same for people who think that Python3 should
be used the same way as Python1: Get over the fact that Python is no
longer little funny language being laughed at by Perl crowd for being
order of magnitude slower at processing text files. While you still can
do little funny tricks we all love Python for, it now also offers
framework to do it right, and it makes little sense saying that doing it
little funny way is the definitive trait of Python.

(And for me it's easy to be such categorical - the only way I could
subscribe to idea of running Python on an MCU and not be laughable is by
trusting Python to provide framework for being efficient. I quit
working on another language because I have trusted that iterator,
generator, buffer protocols are not little funny things but thoroughly
engineered efficient concepts, and I don't feel betrayed.)


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Guido van Rossum
This thread has devolved into a flame war. I think we should trust the
Micropython implementers (whoever they are -- are they participating here?)
to know their users and let them do what feels right to them. We should
just ask them not to claim full compatibility with any particular Python
version -- that seems the most contentious point. Realistically, most
Python code that works on Python 3.4 won't work on Micropython (for various
reasons, not just the string behavior) and neither does it need to.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Stephen J. Turnbull
Serhiy Storchaka writes:

 > It would be interesting to collect a statistic about how many indexing 
 > operations happened during the life of a string in typical (Micro)Python 
 > program.

Probably irrelevant (I doubt anybody is going to be writing
programmers' editors in MicroPython), but by far the most frequently
called functions in XEmacs are byte_to_char_index and its inverse.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 20:05, Paul Sokolovsky написав(ла):

On Wed, 04 Jun 2014 19:49:18 +0300
Serhiy Storchaka  wrote:

html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
don't use iterators. They use indices, str.find and/or regular
expressions. Common use case is quickly find substring starting from
current position using str.find or re.search, process found token,
advance position and repeat.


That's sad, I agree.


Other languages (Go, Rust) can be happy without O(1) indexing of 
strings. All string and regex operations work with iterators or cursors, 
and I believe this approach is not significant worse than implementing 
strings as O(1)-indexable arrays of characters (for some applications it 
can be worse, for other it can be better). But Python is different 
language, it has different operations for strings and different idioms. 
A language which doesn't support O(1) indexing is not Python, it is only 
Python-like language.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 17:49, Paul Sokolovsky написав(ла):

On Thu, 5 Jun 2014 00:26:10 +1000
Chris Angelico  wrote:

On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
 wrote:

04.06.14 10:03, Chris Angelico написав(ла):

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to
be copied and doubled in size (lots of heap space used).

But you need non-ASCII characters to display a title of MP3 track.


Yes, but to display a title, you don't need to do codepoint access at
random - you need to either take a block of memory (length in bytes) and
do something with it (pass to a C function, transfer over some bus,
etc.), or *iterate in order* over codepoints in a string. All these
operations are as efficient (O-notation) for UTF-8 as for UTF-32.


Several previous comments discuss first option, ASCII-only strings.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 19:52, MRAB написав(ла):

In order to avoid indexing, you could use some kind of 'cursor' class to
step forwards and backwards along strings. The cursor could include
both the codepoint index and the byte index.


So you need different string library and different regular expression 
library.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 19:49:18 +0300
Serhiy Storchaka  wrote:

[]
> > But show me real-world case for that. Common usecase is scanning
> > string left-to-right, that should be done using iterator and thus
> > O(N). Right-to-left scanning would be order(s) of magnitude less
> > frequent, as and also handled by iterator.
> 
> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> don't use iterators. They use indices, str.find and/or regular
> expressions. Common use case is quickly find substring starting from
> current position using str.find or re.search, process found token,
> advance position and repeat.

That's sad, I agree.


-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread MRAB

On 2014-06-04 14:33, Nick Coghlan wrote:

On 4 June 2014 15:39,   wrote:

On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:


There's a general expectation that indexing will be O(1) because
all the builtin containers that support that syntax use it for
O(1) lookup operations.


Depending on your definition of built in, there is at least one
standard library container that does not - collections.deque.

Given the specialized kinds of application this Python
implementation is targetted at, it seems UTF-8 is ideal considering
the huge memory savings resulting from the compressed
representation, and the reduced likelihood of there being any real
need for serious text processing on the device.


Right - I wasn't clear that I think storing text internally as UTF-8
sounds fine for MicroPython. Anything where the O(N) nature of
indexing by code point matters probably won't be run in that
environment anyway.


In order to avoid indexing, you could use some kind of 'cursor' class to
step forwards and backwards along strings. The cursor could include
both the codepoint index and the byte index.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 18:38, Paul Sokolovsky написав(ла):

Any non-trivial text parsing uses indices or regular expressions (and
regular expressions themself use indices internally).


I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.


html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't 
use iterators. They use indices, str.find and/or regular expressions. 
Common use case is quickly find substring starting from current position 
using str.find or re.search, process found token, advance position and 
repeat.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread INADA Naoki
For Jython and IronPython, UTF-16 may be best internal encoding.

Recent languages (Swiffy, Golang, Rust) chose UTF-8 as internal encoding.
Using utf-8 is simple and efficient. For example, no need for utf-8
copy of the string when writing to file
and serializing to JSON.

When implementing Python using these languages, UTF-8 will be best
internal encoding.

To allow Python implementations other than CPython can use UTF-8 or
UTF-16 as internal encoding efficiently,
I think adding internal position based API is the best solution.

>>> s = "\U0010x"
>>> len(s)
2
>>> s[1:]
'x'
>>> s.find('x')
1
>>> # s.isize() # Internal length. 5 for utf-8, 3 for utf-16
>>> # s.ifind('x') # Internal position, 4 for utf-8, 2 for utf-16
>>> # s.islice(s.ifind('x')) => 'x'


(I like design of golang and Rust. I hope CPython uses utf-8 as
internal encoding in the future.
But this is off-topic.)


On Wed, Jun 4, 2014 at 4:41 PM, Jeff Allen  wrote:
> Jython uses UTF-16 internally -- probably the only sensible choice in a
> Python that can call Java. Indexing is O(N), fundamentally. By
> "fundamentally", I mean for those strings that have not yet noticed that
> they contain no supplementary (>0x) characters.
>
> I've toyed with making this O(1) universally. Like Steven, I understand this
> to be a freedom afforded to implementers, rather than an issue of
> conformity.
>
> Jeff Allen
>
>
> On 04/06/2014 02:17, Steven D'Aprano wrote:
>>
>> There is a discussion over at MicroPython about the internal
>> representation of Unicode strings.
>
> ...
>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com



-- 
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 01:00:52 +1000
Chris Angelico  wrote:

> On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky 
> wrote:
> >> > But you need non-ASCII characters to display a title of MP3
> >> > track.
> >
> > Yes, but to display a title, you don't need to do codepoint access
> > at random - you need to either take a block of memory (length in
> > bytes) and do something with it (pass to a C function, transfer
> > over some bus, etc.), or *iterate in order* over codepoints in a
> > string. All these operations are as efficient (O-notation) for
> > UTF-8 as for UTF-32.
> 
> Suppose you have a long title, and you need to abbreviate it by
> dropping out words (delimited by whitespace), such that you keep the
> first word (always) and the last (if possible) and as many as possible
> in between. How are you going to write that? With PEP 393 or UTF-32
> strings, you can simply record the index of every whitespace you find,
> count off lengths, and decide what to keep and what to ellipsize.

I'll submit angry bugreport along the lines of "WWWHAT, it's 3.5 and
there's still no str.isplit()??!!11", then do it with re.finditer()
(while submitting another report on inconsistent naming scheme).

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Mark Lawrence

On 04/06/2014 16:32, Steve Dower wrote:


If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.



Out of idle curiosity is there anything that stops MicroPython, or any 
other implementation for that matter, from providing views of a string 
rather than copying every time?  IIRC memoryviews in CPython rely on the 
buffer protocol at the C API level, so since strings don't support this 
protocol you can't take a memoryview of them.  Could this actually be 
implemented in the future, is the underlying C code just too 
complicated, or what?


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steve Dower
Paul Sokolovsky wrote:
> You just shouldn't write inefficient programs, voila. But if you want, you 
> can keep writing inefficient programs, they just will be inefficient. Peace.

Can I nominate this for QOTD? :)

Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Wed, 04 Jun 2014 17:40:14 +0300
Serhiy Storchaka  wrote:

> 04.06.14 17:02, Paul Moore написав(ла):
> > On 4 June 2014 14:39, Serhiy Storchaka  wrote:
> >> I think than breaking O(1) expectation for indexing makes the
> >> implementation significant incompatible with Python. Virtually all
> >> string operations in Python operates with indices.
> >
> > I don't use indexing on strings except in rare situations. Sure I
> > use lots of operations that may well use indexing *internally* but
> > that's the point. MicroPython can optimise those operations without
> > needing to guarantee O(1) indexing, and I'd be fine with that.
> 
> Any non-trivial text parsing uses indices or regular expressions (and 
> regular expressions themself use indices internally).

I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.

What's next? You're doing some funky anagrams and need to swap each 2
adjacent chars? Sorry, naive implementation will be slow. If you're in
serious anagram business, you'll need to code C extension. No, wait!
Instead you should learn Python better. You should run a string
windowing iterator which will return adjacent pair and swap those
constant-len strings.

More cases anyone? Implementing DES and doing arbitrary permutations?
Kindly drop doing that on strings, do it on bytes or lists.

Hopefully, the idea is clear - if you *scan* thru string using indexes
in *random* order, you're doing weird thing and *want* weird
performance. Doing stuff is s[0] ot s[-1] - there's finite (and small)
number of such operation per strings.



Now about taking substrings of strings (which in Python often expressed
by slice indexing). Well, this is quite different from scanning each
character of a strings. Just like s[0]/s[-1] this usually happens
finite number of times for a particular string, independent of its
length, i.e. O(1) times (ex, you take a string and split it in 3
parts), or maybe number of substrings is not bound-fixed, but has
different growth order, O(M) (for example, you split string in tokens,
tokens can be long, but there're usually external limits on how many
it's sensible to have on one line).

So, again, you're not going to get quadric time unless you're unlucky
or sloppy. And just again, you should brush up your Python skills and
use regex functions shich return iterators to get your parsed tokens,
etc.

(To clarify the obvious - "you" here is abstract pronoun, not referring
to respectable Python developers who actually made it possible to write
efficient Python programs).


So, hopefully the point is conveyed - you can write inefficient Python
programs. CPython goes out of the way to hide many inefficiencies (using
unbelievably bloated heap usage - from uPy's point of view, which
starts up in 2K heap). You just shouldn't write inefficient programs,
voila. But if you want, you can keep writing inefficient programs, they
just will be inefficient. Peace.

> It would be interesting to collect a statistic about how many
> indexing operations happened during the life of a string in typical
> (Micro)Python program.

Yup.

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steve Dower
Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?

We're discussing the behaviour under a different (hypothetical) design decision 
than a 1:1 relationship between code points and indexes, so arguing from that 
stance doesn't make much sense.

> e.g.
> 
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
> 
> That cannot fail with Python's semantics.

Agreed, and it shouldn't (I was actually referring to the optimization being 
incorrect for the goal, not the language semantics). What you'd probably find 
is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is 
also correct.

But what are you trying to achieve (why are you writing this code)? All this 
example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use 
indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always) 
> and
> the last (if possible) and as many as possible in between. How are you going 
> to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.

"Recording the index" is where the optimization comes in. With a 
variable-length encoding - heck, even with a fixed-length one - I'd just use 
str.split(' ') (or re.split('\\s', string), depending on how much I care about 
the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', 
string) also provides the same behaviour and gives me the sliced string, so 
there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and 
currently it doesn't perform as well *in CPython*. But if MicroPython is 
focusing on size over speed, I don't see any reason why they shouldn't permit 
different performance characteristics and require a slightly different approach 
to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the 
Python interpreter ecosystem. Jython and IronPython already have different 
string implementations from CPython - having official (and hopefully flexible) 
guidance on deviations from the reference implementation would I think help 
other implementations provide even more value, which is only a good thing for 
Python.

Cheers,
Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth
On Wed, Jun 4, 2014 at 10:12 AM, Steven D'Aprano  wrote:
> On Wed, Jun 04, 2014 at 01:14:04PM +, Steve Dower wrote:
>> I'm agree with Daniel. Directly indexing into text suggests an
>> attempted optimization that is likely to be incorrect for a set of
>> strings.
>
> I'm afraid I don't understand this argument. The language semantics says
> that a string is an array of code points. Every index relates to a
> single code point, no code point extends over two or more indexes.
> There's a 1:1 relationship between code points and indexes. How is
> direct indexing "likely to be incorrect"?

"Useful" is probably a better word. When you get into the complicated
languages and you want to know how wide something is, and you might
have y with two dots on it as one code point or two and left-to-right
and right-to-left indicators and who knows what else... then looking
at individual code points only works sometimes. I get the slicing
idea.

I like the idea that encoding to utf-8 would be the fastest thing you
can do with a string. You could consider doing regexps in that domain,
and other implementation specific optimizations in exactly the same
way that any Python implementation has them.

None of this would make it harder to move a servo.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 17:02, Paul Moore написав(ла):

On 4 June 2014 14:39, Serhiy Storchaka  wrote:

I think than breaking O(1) expectation for indexing makes the implementation
significant incompatible with Python. Virtually all string operations in
Python operates with indices.


I don't use indexing on strings except in rare situations. Sure I use
lots of operations that may well use indexing *internally* but that's
the point. MicroPython can optimise those operations without needing
to guarantee O(1) indexing, and I'd be fine with that.


Any non-trivial text parsing uses indices or regular expressions (and 
regular expressions themself use indices internally).


It would be interesting to collect a statistic about how many indexing 
operations happened during the life of a string in typical (Micro)Python 
program.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico
On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky  wrote:
>> > But you need non-ASCII characters to display a title of MP3 track.
>
> Yes, but to display a title, you don't need to do codepoint access at
> random - you need to either take a block of memory (length in bytes) and
> do something with it (pass to a C function, transfer over some bus,
> etc.), or *iterate in order* over codepoints in a string. All these
> operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Suppose you have a long title, and you need to abbreviate it by
dropping out words (delimited by whitespace), such that you keep the
first word (always) and the last (if possible) and as many as possible
in between. How are you going to write that? With PEP 393 or UTF-32
strings, you can simply record the index of every whitespace you find,
count off lengths, and decide what to keep and what to ellipsize.

> Some operations are not going to be as fast, so - oops - avoid doing
> them without good reason. And kindly drop expectations that doing
> arbitrary operations on *Unicode* are as efficient as you imagined.
> (Note the *Unicode* in general, not particular flavor of which you got
> used to, up to thinking it's the one and only "right" flavor.)

Not sure what you mean by flavors of Unicode. Unicode is a mapping of
codepoints to characters, not an in-memory representation. And I've
been working with Python 3.3 since before it came out, and with Pike
(which has a very similar model) for longer, and in both of them, I
casually perform operations on Unicode strings in the same way that I
used to perform operations on REXX strings (which were eight-bit in
the current system codepage - 437 for us). I do expect those
operations to be efficient, and I get what I expect.

Maybe they won't be in uPy, but that would be a limitation of uPy, not
a fundamental problem with Unicode.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Sokolovsky
Hello,

On Thu, 5 Jun 2014 00:26:10 +1000
Chris Angelico  wrote:

> On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
>  wrote:
> > 04.06.14 10:03, Chris Angelico написав(ла):
> >
> >> Right, which is why I don't like the idea. But you don't need
> >> non-ASCII characters to blink an LED or turn a servo, and there is
> >> significant resistance to the notion that appending a non-ASCII
> >> character to a long ASCII-only string requires the whole string to
> >> be copied and doubled in size (lots of heap space used).
> >
> >
> > But you need non-ASCII characters to display a title of MP3 track.

Yes, but to display a title, you don't need to do codepoint access at
random - you need to either take a block of memory (length in bytes) and
do something with it (pass to a C function, transfer over some bus,
etc.), or *iterate in order* over codepoints in a string. All these
operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Some operations are not going to be as fast, so - oops - avoid doing
them without good reason. And kindly drop expectations that doing
arbitrary operations on *Unicode* are as efficient as you imagined.
(Note the *Unicode* in general, not particular flavor of which you got
used to, up to thinking it's the one and only "right" flavor.)

> Agreed. IMO, any Python, no matter how micro, needs full Unicode
> support; but there is resistance from uPy's devs.

FUD ;-).

> 
> ChrisA

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano
On Wed, Jun 04, 2014 at 01:38:57PM +0300, Paul Sokolovsky wrote:

> That's another reason why people don't like Unicode enforced upon them

Enforcing design and language decisions is the job of the programming 
language. You might as well complain that Python forces C doubles as the 
floating point type, or that it forces Bignums as the integer type, or 
that it forces significant indentation, or "class" as a keyword. Or that 
C forces you to use braces and manage your own memory. That's the 
purpose of the language, to make those decisions as to what features to 
provide and what not to provide.


> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script 

I have no words to describe how ridiculous this accusation is.


> then give up on their arbitrarily chosen, one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

 
> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes

What's does uPy return for the length of '∞'? If the answer is anything 
but 1, that's a bug.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Chris Angelico
On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka  wrote:
> 04.06.14 10:03, Chris Angelico написав(ла):
>
>> Right, which is why I don't like the idea. But you don't need
>> non-ASCII characters to blink an LED or turn a servo, and there is
>> significant resistance to the notion that appending a non-ASCII
>> character to a long ASCII-only string requires the whole string to be
>> copied and doubled in size (lots of heap space used).
>
>
> But you need non-ASCII characters to display a title of MP3 track.

Agreed. IMO, any Python, no matter how micro, needs full Unicode
support; but there is resistance from uPy's devs.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 10:03, Chris Angelico написав(ла):

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to be
copied and doubled in size (lots of heap space used).


But you need non-ASCII characters to display a title of MP3 track.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Steven D'Aprano
On Wed, Jun 04, 2014 at 01:14:04PM +, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an 
> attempted optimization that is likely to be incorrect for a set of 
> strings. 

I'm afraid I don't understand this argument. The language semantics says 
that a string is an array of code points. Every index relates to a 
single code point, no code point extends over two or more indexes. 
There's a 1:1 relationship between code points and indexes. How is 
direct indexing "likely to be incorrect"?

e.g.

s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'

That cannot fail with Python's semantics.

[Aside: it does fail in Python 2, showing that the idea that "strings 
are bytes" is fatally broken. Fortunately Python has moved beyond that.]


> Splitting, regex, concatenation and formatting are really the 
> main operations that matter, and MicroPython can optimize their 
> implementation of these easily enough for O(N) indexing.

Really? Well, it will be a nice experiment. Fortunately MicroPython runs 
under Linux as well as on embedded systems (a clever decision, by the 
way) so I look forward to seeing how their internal-utf8 implementation 
stacks up against CPython's FSR implementation.

Out of curiosity, when the FSR was proposed, did anyone consider an 
internal UTF-8 representation? If so, why was it rejected?




-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Paul Moore
On 4 June 2014 14:39, Serhiy Storchaka  wrote:
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.

I don't use indexing on strings except in rare situations. Sure I use
lots of operations that may well use indexing *internally* but that's
the point. MicroPython can optimise those operations without needing
to guarantee O(1) indexing, and I'd be fine with that.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Daniel Holth
MicroPython is going to be significantly incompatible with Python
anyway. But you should be able to run your mp code on regular Python.

On Wed, Jun 4, 2014 at 9:39 AM, Serhiy Storchaka  wrote:
> 04.06.14 04:17, Steven D'Aprano написав(ла):
>
>> Would either of these trade-offs be acceptable while still claiming
>> "Python 3.4 compatibility"?
>>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>
>
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.
>
> O(1) indexing operations can be kept with minimal memory requirements if
> implement Unicode internally as modified UTF-8 plus optional array of
> offsets for every, say, 32th character (which even can be compressed to an
> array of 16-bit or 32-bit integers).
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Serhiy Storchaka

04.06.14 04:17, Steven D'Aprano написав(ла):

Would either of these trade-offs be acceptable while still claiming
"Python 3.4 compatibility"?

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.


I think than breaking O(1) expectation for indexing makes the 
implementation significant incompatible with Python. Virtually all 
string operations in Python operates with indices.


O(1) indexing operations can be kept with minimal memory requirements if 
implement Unicode internally as modified UTF-8 plus optional array of 
offsets for every, say, 32th character (which even can be compressed to 
an array of 16-bit or 32-bit integers).


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >