subject:"\[Python\-Dev\] PEP 461 updates"

Re: [Python-Dev] PEP 461 updates

2014-01-21 Thread Chris Barker

On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin
wrote:

> > long as numpy.loadtxt is explicitly documented as only working with
> > latin-1 encoded files (it currently isn't), there's no problem.
>
> Actually there is problem. If it explicitly specified the encoding as
> latin-1 when opening the file then it could document the fact that it
> works for latin-1 encoded files. However it actually uses the system
> default encoding to read the file

which is a really bad default -- oh well. Also, I don't think it was a
choice, at least not a well thought out one, but rather what fell out of
tryin gto make it "just work" on py3.

and then converts the strings to
> bytes with the as_bytes function that is hard-coded to use latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
> So it only works if the system default encoding is latin-1 and the
> file content is white-space and newline compatible with latin-1.
> Regardless of whether the file itself is in utf-8 or latin-1 it will
> only work if the system default encoding is latin-1. I've never used a
> system that had latin-1 as the default encoding (unless you count
> cp1252 as latin-1).
>

even if it was a common default it would be a "bad idea". Fortunately (?),
so it really is broken, we can fix it without being too constrained by
backwards compatibility.

>
> > If it's supposed to work with other encodings (but the entire file is
> > still required to use a consistent encoding), then it just needs
> > encoding and errors arguments to fit the Python 3 text model (with
> > "latin-1" documented as the default encoding).
>
> This is the right solution. Have an encoding argument, document the
> fact that it will use the system default encoding if none is
> specified, and re-encode using the same encoding to fit any dtype='S'
> bytes column. This will then work for any encoding including the ones
> that aren't ASCII-compatible (e.g. utf-16).
>

Exactly, except I dont think the system encoding as a default is a good
choice. If there is a default MOST people will use it. And it will work for
a lot of their test code. Then it will break if the code is passed to a
system with a different default encoding, or a file comes from another
source in a different encoding. This is very, very likely. Far
more likely that files consistently being in the system encoding

> > default behaviour, since passing something like
> > codecs.getdecoder("utf-8") as a column converter should do the right
> > thing.
>

that seems to work at the moment, actually, if done with care.

That's just getting silly IMO. If the file uses mixed encodings then I
> don't consider it a valid "text file" and see no reason for loadtxt to
> support reading it.

agreed -- that's just getting crazy -- the only use-case I can image is to
clean up a file that got moji-baked by some other process -- not really the
use case for loadtxt and friends.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-19 Thread Oscar Benjamin

On 19 January 2014 06:19, Nick Coghlan  wrote:
>
> While I agree it's not relevant to the PEP 460/461 discussions, so
> long as numpy.loadtxt is explicitly documented as only working with
> latin-1 encoded files (it currently isn't), there's no problem.

Actually there is problem. If it explicitly specified the encoding as
latin-1 when opening the file then it could document the fact that it
works for latin-1 encoded files. However it actually uses the system
default encoding to read the file and then converts the strings to
bytes with the as_bytes function that is hard-coded to use latin-1:
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

So it only works if the system default encoding is latin-1 and the
file content is white-space and newline compatible with latin-1.
Regardless of whether the file itself is in utf-8 or latin-1 it will
only work if the system default encoding is latin-1. I've never used a
system that had latin-1 as the default encoding (unless you count
cp1252 as latin-1).

> If it's supposed to work with other encodings (but the entire file is
> still required to use a consistent encoding), then it just needs
> encoding and errors arguments to fit the Python 3 text model (with
> "latin-1" documented as the default encoding).

This is the right solution. Have an encoding argument, document the
fact that it will use the system default encoding if none is
specified, and re-encode using the same encoding to fit any dtype='S'
bytes column. This will then work for any encoding including the ones
that aren't ASCII-compatible (e.g. utf-16).

Then instead of having a compat module with an as_bytes helper to get
rid of all the unicode strings on Python 3, you can have a compat
module with an open_unicode helper to do the right thing on Python 2.
The as_bytes function is just a way of fighting the Python 3 text
model: "I don't care about mojibake just do whatever it takes to shut
up the interpreter and its error messages and make sure it works for
ASCII data."

> If it is intended to
> allow S columns to contain text in arbitrary encodings, then that
> should also be supported by the current API with an adjustment to the
> default behaviour, since passing something like
> codecs.getdecoder("utf-8") as a column converter should do the right
> thing. However, if you're currently decoding S columns with latin-1
> *before* passing the value to the converter, then you'll need to use a
> WSGI style decoding dance instead:
>
> def fix_encoding(text):
> return text.encode("latin-1").decode("utf-8") # For example

That's just getting silly IMO. If the file uses mixed encodings then I
don't consider it a valid "text file" and see no reason for loadtxt to
support reading it.

> That's more wasteful than just passing the raw bytes through for
> decoding, but is the simplest backwards compatible option if you're
> doing latin-1 decoding already.
>
> If different rows in the *same* column are allowed to have different
> encodings, then that's not a valid use of the operation (since the
> column converter has no access to the rest of the row to determine
> what encoding should be used for the decode operation).

Ditto.

Oscar
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-18 Thread Nick Coghlan

On 19 January 2014 00:39, Oscar Benjamin  wrote:
>
> If you want to draw a relevant lesson from that thread in this one
> then the lesson argues against PEP 461: adding back the bytes
> formatting methods helps people who refuse to understand text
> processing and continue implementing dirty hacks instead of doing it
> properly.

Yes, that's why it has taken so long to even *consider* bringing
binary interpolation support back - one of our primary concerns in the
early days of Python 3 was developers (including core developers!)
attempting to translate bad habits from Python 2 into Python 3 by
continuing to treat binary data as text. Making interpolation a purely
text domain operation helped strongly in enforcing this distinction,
as it generally required thinking about encoding issues in order to
get things into the text domain (or hitting them with the "latin-1"
hammer, in which case... *sigh*).

The reason PEP 460/461 came up is that we *do* acknowledge that there
is a legitimate use case for binary interpolation support when dealing
with binary formats that contain ASCII compatible segments. Now that
people have had a few years to get used to the Python 3 text model ,
lowering the barrier to migration from Python 2 and better handling
that use case in Python 3 in general has finally tilted the scales in
favour of providing the feature (assuming Guido is happy with PEP 461
after Ethan finishes the Rationale section).

(Tangent)

While I agree it's not relevant to the PEP 460/461 discussions, so
long as numpy.loadtxt is explicitly documented as only working with
latin-1 encoded files (it currently isn't), there's no problem. If
it's supposed to work with other encodings (but the entire file is
still required to use a consistent encoding), then it just needs
encoding and errors arguments to fit the Python 3 text model (with
"latin-1" documented as the default encoding). If it is intended to
allow S columns to contain text in arbitrary encodings, then that
should also be supported by the current API with an adjustment to the
default behaviour, since passing something like
codecs.getdecoder("utf-8") as a column converter should do the right
thing. However, if you're currently decoding S columns with latin-1
*before* passing the value to the converter, then you'll need to use a
WSGI style decoding dance instead:

def fix_encoding(text):
return text.encode("latin-1").decode("utf-8") # For example

That's more wasteful than just passing the raw bytes through for
decoding, but is the simplest backwards compatible option if you're
doing latin-1 decoding already.

If different rows in the *same* column are allowed to have different
encodings, then that's not a valid use of the operation (since the
column converter has no access to the rest of the row to determine
what encoding should be used for the decode operation).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-18 Thread Oscar Benjamin

On 17 January 2014 21:37, Chris Barker  wrote:
>
> For the record, we've got a pretty good thread (not this good, though!) over
> on the numpy list about how to untangle the mess that has resulted from
> porting text-file-parsing code to py3 (and the underlying issue with the 'S'
> data type in numpy...)
>
> One note from the github issue:
> """
>  The use of asbytes originates only from the fact that b'%d' % (20,) does
> not work.
> """
>
> So yeah PEP 461! (even if too late for numpy...)

The discussion about numpy.loadtxt and the 'S' dtype is not relevant
to PEP 461.  PEP 461 is about facilitating handling ascii/binary
protocols and file formats. The loadtxt function is for reading text
files. Reading text files is already handled very well in Python 3.
The only caveat is that you need to specify the encoding when you open
the file.

The loadtxt function doesn't specify the encoding when it opens the
file so on Python 3 it gets the system default encoding when reading
from the file. Since the 'S' dtype is for an array of bytes the
loadtxt function has to encode the unicode strings before storing them
in the array. The function has no idea what encoding the user wants so
it just uses latin-1 leading to mojibake if the file content and
encoding are not compatible with latin-1 e.g.: utf-8.

The loadtxt function is a classic example of how *not* to do text and
whoever made it that way probably didn't understand unicode and the
Python 3 text model. If they did understand what they were doing then
they knew that they were implementing a dirty hack.

If you want to draw a relevant lesson from that thread in this one
then the lesson argues against PEP 461: adding back the bytes
formatting methods helps people who refuse to understand text
processing and continue implementing dirty hacks instead of doing it
properly.

Oscar
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Eric V. Smith

On 1/17/2014 4:37 PM, Chris Barker wrote:
> For the record, we've got a pretty good thread (not this good, though!)
> over on the numpy list about how to untangle the mess that has resulted
> from porting text-file-parsing code to py3 (and the underlying issue
> with the 'S' data type in numpy...)
> 
> One note from the github issue:
> """
>  The use of asbytes originates only from the fact that b'%d' % (20,)
> does not work.
> """
> 
> So yeah PEP 461! (even if too late for numpy...)

Would they use "(u'%d' % (20,)).encode('ascii')" for that? Just curious
as to what they're planning on doing.

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Chris Barker

I hope you didn't mean to take this off-list:
On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer  wrote:

> In gmane.comp.python.devel, you wrote:
> > For the record, we've got a pretty good thread (not this good, though!)
> > over on the numpy list about how to untangle the mess that has resulted
>

> Not sure about your definition of good. ;-)

well, in the sense of "big" anyway...

>  Could you summarize the main points on python-dev?  I'm not feeling up to
> wading through
> another massive thread but I'm quite interested to hear the
> challenges that numpy deals with.

Well, not much new to it, really. But here's a re-cap:

numpy has had an 'S' dtype for a while, which corresponded to the py2
string type (except for being fixed length). So it could auto-convert
to-from python strings... all was good and happy.

Enter py3: what to do? there is no py2 string type anymore. So it was
decided to have the 'S' dtype correspond to the py3 bytes
type. Apparently there was thought of renaming it, but the 'B' and 'b'
type identifiers were already takes, so 'S' was kept.

However, as we all know in this thread, the py3 bytes type is not the same
thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S'
type for text data -- so that is kind of broken in py3.

However, other folks use the 'S' type for binary data, so like (and rely
on) it being mapped to the py3 bytes type. So we are stuck with that.

Given the nature of numpy, and scientific data, there is talk of having a
one-byte-per-char text type in numpy (there is already a unicode type, but
it uses 4-bytes-per-char, as it's key to the numpy data model that all
objects of a given type are the same size.) This would be analogous to the
current multiple precision options for numbers. It would take up less
memory, and would not be able to hold all values. It's not clear what the
level of support is for this right now -- after all, you can do everything
you need to do with the appropriate calls to encode() and decode(), if a
bit awkward.

Meanwhile, back at the ranch -- related, but separate issues
have arisen with the functions that parse text files: numpy.loadtxt and
numpy.genfromtxt. These functions were adapted for py3 just enough to get
things to mostly work, but have some serious limitations when doing
anything with unicode -- and in fact do some weird things with plain ascii
text files if you ask it to create unicode objects, and that is a natural
thing to do (and the "right" thing to do in the Py3 text model) if you do
something like:

arr = loadtxt('a_file_name', dtype=str)

on py3, an str is a py3unicode string, so you get the numpy 'U' datatype
but loadtxt wasn't designed to deal with that, so you can get stuff like:

["b'C:UsersDocumentsProjectmytextfile1.txt'"
 "b'C:UsersDocumentsProjectmytextfile2.txt'"
 "b'C:UsersDocumentsProjectmytextfile3.txt'"]

This was (Presumably, I haven't debugged the code) due to conversion from
bytes to unicode...(I'm still confused about the extra slashes)

And this ascii text -- it gets worse if there is non-ascii text in there.

Anyway, the truth is, this stuff is hard, but it will get at least a touch
easier with PEP 461.

[though to be truthful, I'm not sure why someone put a comment in the issue
tracker about b'%d'%some_num being an issue ... I'm not sure how when we're
going from text to numbers, not the other way around...]

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Chris Barker

For the record, we've got a pretty good thread (not this good, though!)
over on the numpy list about how to untangle the mess that has resulted
from porting text-file-parsing code to py3 (and the underlying issue with
the 'S' data type in numpy...)

One note from the github issue:
"""
 The use of asbytes originates only from the fact that b'%d' % (20,) does
not work.
"""

So yeah PEP 461! (even if too late for numpy...)

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-17 Thread Stephen J. Turnbull

Steven D'Aprano writes:
 > On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:

 > > "ASCII compatible" is a technical term in encodings, which means
 > > "bytes in the range 0-127 always have ASCII coded character semantics,
 > > do what you like with bytes in the range 128-255."[1]
 > 
 > Examples, and counter-examples, may help. Let me see if I have got this 
 > right: an ASCII-compatible encoding may be an ASCII-superset like 
 > Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
 > are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
 > counter-example would be UTF-16, or some of the Asian encodings like 
 > Big5. Am I right so far?

All correct.

 > But Nick isn't talking about an encoding, he's talking about a data 
 > format. I think that an ASCII-compatible format means one where (in at 
 > least *some* parts of the data) bytes between 0 and 127 have the same 
 > meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
 > character "T". This doesn't mean that every byte 84 means "T", only that 
 > some of them do -- hopefully a well-defined sections of the data. Below, 
 > you introduce the term "ASCII segments" for these.

Yes, except that I believe Nick, as well as the "file-and-wire guys",
strengthen "hopefully well-defined" to just "well-defined".

 > >  are designed for use *only* on bytes
 > > that are ASCII segments; use on other data is likely to cause
 > > hard-to-diagnose corruption.
 > 
 > An example: if you have the byte b'\x63', calling upper() on that will 
 > return b'\x43'. That is only meaningful if the byte is intended as the 
 > ASCII character "c".

Good example.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Glenn Linderman


On 1/16/2014 9:46 PM, Nick Coghlan wrote:

On 17 January 2014 11:51, Ethan Furman  wrote:

On 01/16/2014 05:32 PM, Greg wrote:


I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.


And a good thing, too, on both counts!  :)

A few folks have suggested not implementing .format() on bytes;  I've been
resistant, but then I remembered that format is also a function.

http://docs.python.org/3/library/functions.html?highlight=ascii#format
==
format(value[, format_spec])

 Convert a value to a “formatted” representation, as controlled by
format_spec. The interpretation of format_spec will depend on the type of
the value argument, however there is a standard formatting syntax that is
used by most built-in types: Format Specification Mini-Language.

 The default format_spec is an empty string which usually gives the same
effect as calling str(value).

 A call to format(value, format_spec) is translated to
type(value).__format__(format_spec) which bypasses the instance dictionary
when searching for the value’s __format__() method. A TypeError exception is
raised if the method is not found or if either the format_spec or the return
value are not strings.
==

Given that, I can relent on .format and just go with .__mod__ .  A low-level
service for a low-level protocol, what?  ;)

Exactly - while I'm a fan of the new extensible formatting system and
strongly prefer it to printf-style formatting for text, it also has a
whole lot of complexity that is hard to translate to the binary
domain, including the format() builtin and __format__ methods.

Since the relevant use cases appear to be already covered adequately
by prinft-style formatting, attempting to translate the flexible text
formatting system as well just becomes additional complexity we don't
need.

I like Stephen Turnbull's suggestion of using "binary formats with
ASCII segments" to distinguish the kind of formats we're talking about
from ASCII compatible text encodings,


I liked that too, and almost said so on his posting, but will say it 
here, instead.



and I think Python 3.5 will end
up with a suite of solutions that suitably covers all use cases, just
by bringing back printf-style formatting directly to bytes:

* format(), str.format(), str.format_map(): a rich extensible text
formatting system, including date interpolation support
* str.__mod__: retained primarily for backwards compatibility, may
occasionally be used as a text formatting optimisation tool (since the
inflexibility means it will likely always be marginally faster than
the rich formatting system for the cases that it covers)
* bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify
production of data in variable length binary formats that contain
ASCII segments
* the struct module: rich (but not extensible) formatting system for
fixed length binary formats


Adding format codes with variable length could enhance the struct module 
to additional uses. C structs, on which it is modeled, often get around 
the difficulty of variable length items by defining one variable length 
item at the end, or by defining offsets in the fixed part, to variable 
length parts that follows. Such a structure cannot presently be created 
by struct alone.



In Python 2, the binary format with ASCII segments use case was
intermingled with general purpose text formatting on the str type,
which is I think the main reason it has taken us so long to convince
ourselves it is something that is genuinely worth bringing back in a
more limited form in Python 3, rather than just being something we
wanted back because we were used to having it in Python 2.

Cheers,
Nick.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan

On 17 January 2014 11:51, Ethan Furman  wrote:
> On 01/16/2014 05:32 PM, Greg wrote:
>>
>>
>> I don't think it matters whether the internal details of that
>> debate make sense to the rest of us. The main thing is that
>> a consensus seems to have been reached on bytes formatting
>> being basically a good thing.
>
>
> And a good thing, too, on both counts!  :)
>
> A few folks have suggested not implementing .format() on bytes;  I've been
> resistant, but then I remembered that format is also a function.
>
> http://docs.python.org/3/library/functions.html?highlight=ascii#format
> ==
> format(value[, format_spec])
>
> Convert a value to a “formatted” representation, as controlled by
> format_spec. The interpretation of format_spec will depend on the type of
> the value argument, however there is a standard formatting syntax that is
> used by most built-in types: Format Specification Mini-Language.
>
> The default format_spec is an empty string which usually gives the same
> effect as calling str(value).
>
> A call to format(value, format_spec) is translated to
> type(value).__format__(format_spec) which bypasses the instance dictionary
> when searching for the value’s __format__() method. A TypeError exception is
> raised if the method is not found or if either the format_spec or the return
> value are not strings.
> ==
>
> Given that, I can relent on .format and just go with .__mod__ .  A low-level
> service for a low-level protocol, what?  ;)

Exactly - while I'm a fan of the new extensible formatting system and
strongly prefer it to printf-style formatting for text, it also has a
whole lot of complexity that is hard to translate to the binary
domain, including the format() builtin and __format__ methods.

Since the relevant use cases appear to be already covered adequately
by prinft-style formatting, attempting to translate the flexible text
formatting system as well just becomes additional complexity we don't
need.

I like Stephen Turnbull's suggestion of using "binary formats with
ASCII segments" to distinguish the kind of formats we're talking about
from ASCII compatible text encodings, and I think Python 3.5 will end
up with a suite of solutions that suitably covers all use cases, just
by bringing back printf-style formatting directly to bytes:

* format(), str.format(), str.format_map(): a rich extensible text
formatting system, including date interpolation support
* str.__mod__: retained primarily for backwards compatibility, may
occasionally be used as a text formatting optimisation tool (since the
inflexibility means it will likely always be marginally faster than
the rich formatting system for the cases that it covers)
* bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify
production of data in variable length binary formats that contain
ASCII segments
* the struct module: rich (but not extensible) formatting system for
fixed length binary formats

In Python 2, the binary format with ASCII segments use case was
intermingled with general purpose text formatting on the str type,
which is I think the main reason it has taken us so long to convince
ourselves it is something that is genuinely worth bringing back in a
more limited form in Python 3, rather than just being something we
wanted back because we were used to having it in Python 2.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Steven D'Aprano

On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
> Meta enough that I'll take Guido out of the CC.
> 
> Nick Coghlan writes:
> 
>  > There are plenty of data formats (like SMTP and HTTP) that are
>  > constrained to be ASCII compatible,
> 
> "ASCII compatible" is a technical term in encodings, which means
> "bytes in the range 0-127 always have ASCII coded character semantics,
> do what you like with bytes in the range 128-255."[1]

Examples, and counter-examples, may help. Let me see if I have got this 
right: an ASCII-compatible encoding may be an ASCII-superset like 
Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
counter-example would be UTF-16, or some of the Asian encodings like 
Big5. Am I right so far?

But Nick isn't talking about an encoding, he's talking about a data 
format. I think that an ASCII-compatible format means one where (in at 
least *some* parts of the data) bytes between 0 and 127 have the same 
meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
character "T". This doesn't mean that every byte 84 means "T", only that 
some of them do -- hopefully a well-defined sections of the data. Below, 
you introduce the term "ASCII segments" for these.

> Worse, it's clearly confusing in this discussion.  Let's stop using
> this term to mean
> 
> the data format has elements that are defined to contain only
> bytes with ASCII coded character semantics
> 
> (which is the relevant restriction AFAICS -- I don't know of any
> ASCII-compatible formats where the bytes 128-255 are used for any
> purpose other than encoding non-ASCII characters).  OTOH, if it *is*
> an ASCII-compatible text encoding, the semantics are dubious if the
> bytes versions of many of these methods/operations are used.
> 
> A documentation suggestion: It's easy enough to rewrite
> 
>  > constrained to be ASCII compatible, either globally, or locally in
>  > the parts being manipulated by an application (such as a file
>  > header). ASCII incompatible segments may be present, but in ways
>  > that allow the data processing to handle them correctly.
> 
> as 
> 
> containing 'well-defined segments constrained to be (strictly)
> ASCII-encoded' (aka ASCII segments).
> 
> And then you can say 
> 
>  are designed for use *only* on bytes
> that are ASCII segments; use on other data is likely to cause
> hard-to-diagnose corruption.

An example: if you have the byte b'\x63', calling upper() on that will 
return b'\x43'. That is only meaningful if the byte is intended as the 
ASCII character "c".

> Footnotes: 
> [1]  "ASCII coded character semantics" is of course mildly ambiguous
> due to considerations like EOL conventions.  But "you know what I'm
> talking about".

I think I know what your talking about, but don't know for sure unless I 
explain it back to you.

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Neil Schemenauer

Greg  wrote:
> I don't think it matters whether the internal details of that
> debate make sense to the rest of us. The main thing is that
> a consensus seems to have been reached on bytes formatting
> being basically a good thing.

I've been mostly steering clear of the metaphysical and writing
code today. ;-)  An extremely rough patch has been uploaded:

http://bugs.python.org/issue20284

I have a new one almost ready that introduces __ascii__ rather than
overloading __format__.  I like it better, will upload to issue
tracker soon.

Regards,

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Stephen J. Turnbull

Greg writes:

 > I don't think it matters whether the internal details of [the EIBTI
 > vs. PBP] debate make sense to the rest of us. The main thing is
 > that a consensus seems to have been reached on bytes formatting
 > being basically a good thing.

I think some of it matters to the documentation.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Stephen J. Turnbull

Meta enough that I'll take Guido out of the CC.

Nick Coghlan writes:

 > There are plenty of data formats (like SMTP and HTTP) that are
 > constrained to be ASCII compatible,

"ASCII compatible" is a technical term in encodings, which means
"bytes in the range 0-127 always have ASCII coded character semantics,
do what you like with bytes in the range 128-255."[1]

Worse, it's clearly confusing in this discussion.  Let's stop using
this term to mean

the data format has elements that are defined to contain only
bytes with ASCII coded character semantics

(which is the relevant restriction AFAICS -- I don't know of any
ASCII-compatible formats where the bytes 128-255 are used for any
purpose other than encoding non-ASCII characters).  OTOH, if it *is*
an ASCII-compatible text encoding, the semantics are dubious if the
bytes versions of many of these methods/operations are used.

A documentation suggestion: It's easy enough to rewrite

 > constrained to be ASCII compatible, either globally, or locally in
 > the parts being manipulated by an application (such as a file
 > header). ASCII incompatible segments may be present, but in ways
 > that allow the data processing to handle them correctly.

as 

containing 'well-defined segments constrained to be (strictly)
ASCII-encoded' (aka ASCII segments).

And then you can say 

 are designed for use *only* on bytes
that are ASCII segments; use on other data is likely to cause
hard-to-diagnose corruption.

If there are other use cases for "ASCII-compatible data formats" as
defined above (not worrying about codecs, because they are a very
small minority of code-to-be-written at this point), I don't know
about them.  Does anyone?  If there are any, I'll be happy to revise.
If not, that seems to be a precise and intelligible statement of the
restrictions that is useful to the practical use cases.  And nothing
stops users who think they know what they're doing from using them in
other contexts (which can be documented if they turn out to be broadly
useful).

Footnotes: 
[1]  "ASCII coded character semantics" is of course mildly ambiguous
due to considerations like EOL conventions.  But "you know what I'm
talking about".

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Ethan Furman


On 01/16/2014 05:32 PM, Greg wrote:


I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.


And a good thing, too, on both counts!  :)

A few folks have suggested not implementing .format() on bytes;  I've been resistant, but then I remembered that format 
is also a function.


http://docs.python.org/3/library/functions.html?highlight=ascii#format
==
format(value[, format_spec])

Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec 
will depend on the type of the value argument, however there is a standard formatting syntax that is used by most 
built-in types: Format Specification Mini-Language.


The default format_spec is an empty string which usually gives the same 
effect as calling str(value).

A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the 
instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is 
not found or if either the format_spec or the return value are not strings.

==

Given that, I can relent on .format and just go with .__mod__ .  A low-level 
service for a low-level protocol, what?  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Greg


On 17/01/2014 10:18 a.m., Terry Reedy wrote:

On 1/16/2014 5:11 AM, Nick Coghlan wrote:


Guido's successful counter was to point out that the parsing of the
format string itself assumes ASCII compatible data,


Nick's initial arguments against bytes formatting were very
abstract and philosophical, along the lines that it violated
some pure mental model of text/bytes separation.

Then Guido said something that Nick took to be an equal and
opposite philosophical argument that cancelled out his original
objections, and he withdrew them.

I don't think it matters whether the internal details of that
debate make sense to the rest of us. The main thing is that
a consensus seems to have been reached on bytes formatting
being basically a good thing.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan

On 17 Jan 2014 09:36, "Terry Reedy"  wrote:
>
> On 1/16/2014 4:59 PM, Guido van Rossum wrote:
>
>> I'm getting tired of "did you understand what I said".
>
>
> I was asking whether I needed to repeat myself, but forget that.
> I was also saying that while I understand 'ascii-compatible encoding', I
do not understand the notion of 'ascii-compatible data' or statements based
on it.

There are plenty of data formats (like SMTP and HTTP) that are constrained
to be ASCII compatible, either globally, or locally in the parts being
manipulated by an application (such as a file header). ASCII incompatible
segments may be present, but in ways that allow the data processing to
handle them correctly. The ASCII assuming methods on bytes objects are
there to help in dealing with that kind of data.

If the binary data is just one large block in a single text encoding, it's
generally easier to just decode it to text, but multipart formats generally
don't allow that.

>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Terry Reedy


On 1/16/2014 4:59 PM, Guido van Rossum wrote:


I'm getting tired of "did you understand what I said".


I was asking whether I needed to repeat myself, but forget that.
I was also saying that while I understand 'ascii-compatible encoding', I 
do not understand the notion of 'ascii-compatible data' or statements 
based on it.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Guido van Rossum

On Thu, Jan 16, 2014 at 1:18 PM, Terry Reedy  wrote:
> On 1/16/2014 5:11 AM, Nick Coghlan wrote:
>
>> Guido's successful counter was to point out that the parsing of the
>> format string itself assumes ASCII compatible data,
>
> Did you see my explanation, which I wrote in response to one of your earlier
> posts, of why I think "the parsing of the format string itself assumes ASCII
> compatible data" that statement is confused and wrong? The above seems to
> say that what I wrote is impossible, but perhaps I misunderstand what Guido
> and you mean. Among my questions are "by data, do you mean interpolated
> objects or interpolated bytes?" and "what restriction on 'data' do you
> intend by 'ASCII compatible'?".

Can you move the meta-discussion off-list? I'm getting tired of "did
you understand what I said".

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Terry Reedy


On 1/16/2014 5:11 AM, Nick Coghlan wrote:


Guido's successful counter was to point out that the parsing of the
format string itself assumes ASCII compatible data,


Did you see my explanation, which I wrote in response to one of your 
earlier posts, of why I think "the parsing of the format string itself 
assumes ASCII compatible data" that statement is confused and wrong? The 
above seems to say that what I wrote is impossible, but perhaps I 
misunderstand what Guido and you mean. Among my questions are "by data, 
do you mean interpolated objects or interpolated bytes?" and "what 
restriction on 'data' do you intend by 'ASCII compatible'?".


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Neil Schemenauer

Carl Meyer  wrote:
> I think the PEP could really use a rationale section summarizing _why_
> these formatting operations are being added to bytes

I agree.  My attempt at re-writing the PEP is below.

>> In order to avoid the problems of auto-conversion and
>> value-generated exceptions, all object checking will be done via
>> isinstance, not by values contained in a Unicode representation.
>> In other words::
>> 
>>   - duck-typing to allow/reject entry into a byte-stream
>>   - no value generated errors
>
> This seems self-contradictory; "isinstance" is type-checking, which is
> the opposite of duck-typing.

Again, I agree.  We should avoid isinstance checks if possible.

Abstract

This PEP proposes adding %-interpolation to the bytes object.

Rational

A distruptive but useful change introduced in Python 3.0 was the clean
separation of byte strings (i.e. the "bytes" object) from character
strings (i.e. the "str" object).  The benefit is that character
encodings must be explicitly specified and the risk of corrupting
character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs that deal
with network protocols often manipulate ASCII encoded strings.  Since
the "bytes" type does not support string formatting, extra encoding and
decoding between the "str" type is required.

For simplicity and convenience it is desireable to introduce formatting
methods to "bytes" that allow formatting of ASCII-encoded character
data.  This change would blur the clean separation of byte strings and
character strings.  However, it is felt that the practical benefits
outweigh the purity costs.  The implicit assumption of ASCII-encoding
would be limited to formatting methods.

One source of many problems with the Python 2 Unicode implementation is
the implicit coercion of Unicode character strings into byte strings
using the "ascii" codec.  If the character strings contain only ASCII
characters, all was well.  However, if the string contains a non-ASCII
character then coercion causes an exception.

The combination of implicit coercion and value dependent failures has
proven to be a recipe for hard to debug errors.  A program may seem to
work correctly when tested (e.g. string input that happened to be ASCII
only) but later would fail, often with a traceback far from the source
of the real error.  The formatting methods for bytes() should avoid this
problem by not implicitly encoding data that might fail based on the
content of the data.

Another desirable feature is to allow arbitrary user classes to be used
as formatting operands.  Generally this is done by introducing a special
method that can be implemented by the new class.

Proposed semantics for bytes formatting
===

Special method __ascii__

A new special method, analogous to __format__, is introduced.  This
method takes a single argument, a format specifier.  The return
value is a bytes object.  Objects that have an ASCII only
representation can implement this method to allow them to be used as
format operators.  Objects with natural byte representations should
implement __bytes__ or the Py_buffer API.

%-interpolation
---

All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
will be supported, and will work as they do for str, including the
padding, justification and other related modifiers.  To avoid having to
introduce two special methods, the format specifications will be
translated to equivalent __format__ specifiers and __ascii__ method
of each argument would be called.

Example::

   >>> b'%4x' % 10
   b'   a'

%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.

Example:

>>> b'%c' % 48
b'0'

>>> b'%c' % b'a'
b'a'

%s is a restricted in what it will accept::

  - input type supports Py_buffer or has __bytes__?
use it to collect the necessary bytes (may contain non-ASCII
characters)

  - input type is something else?
use its __ascii__ method; if there isn't one, raise TypeErorr

Examples:

>>> b'%s' % b'abc'
b'abc'

>>> b'%s' % 3.14
b'3.14'

>>> b'%4s' % 12
b'  12'

>>> b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: 'hello world' has no __ascii__ method, perhaps you need to 
encode it?

.. note::

   Because the str type does not have a __ascii__ method, attempts to
   directly use 'a string' as a bytes interpolation value will raise an
   exception.  To use 'string' values, they must be encoded or otherwise
   transformed into a bytes sequence::

  'a string'.encode('latin-1')

Unsupported % format codes
^^

%r (which calls __repr__) is not supported

format
--

The format() method will not be implemented at this time but may be
added in a later Python release.  The __ascii__ method is

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Ethan Furman


On 01/16/2014 04:49 AM, Michael Urman wrote:

On Thu, Jan 16, 2014 at 1:52 AM, Ethan Furman  wrote:

Is this an intended exception to the overriding principle?



Hmm, thanks for spotting that.  Yes, that would be a value error if anything
over 255 is used, both currently in Py2, and for bytes in Py3.  As Carl
suggested, a little more explanation is needed in the PEP.


FYI, note that str/unicode already has another value-dependent
exception with %c. I find the message surprising, as I wasn't aware
Python had a 'char' type:


'%c' % 'a'

'a'

'%c' % 'abc'

Traceback (most recent call last):
   File "", line 1, in 
TypeError: %c requires int or char


Python doesn't have a char type, it has str's of length 1... which are usually 
referred to as char's.  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-16 Thread Nick Coghlan

On 16 Jan 2014 11:45, "Carl Meyer"  wrote:
>
> Hi Ethan,
>
> I haven't chimed into this discussion, but the direction it's headed
> recently seems right to me. Thanks for putting together a PEP. Some
> comments on it:
>
> On 01/15/2014 05:13 PM, Ethan Furman wrote:
> > 
> > Abstract
> > 
> >
> > This PEP proposes adding the % and {} formatting operations from str to
> > bytes [1].
>
> I think the PEP could really use a rationale section summarizing _why_
> these formatting operations are being added to bytes; namely that they
> are useful when working with various ASCIIish-but-not-properly-text
> network protocols and file formats, and in particular when porting code
> dealing with such formats/protocols from Python 2.
>
> Also I think it would be useful to have a section summarizing the
> primary objections that have been raised, and why those objections have
> been overruled (presuming the PEP is accepted). For instance: the main
> objection, AIUI, has been that the bytes type is for pure bytes-handling
> with no assumptions about encoding, and thus we should not add features
> to it that assume ASCIIness, and that may be attractive nuisances for
> people writing bytes-handling code that should not assume ASCIIness but
> will once they use the feature.

Close, but not quite - the concern was that this was a feature that didn't
*inherently* imply a restriction to ASCII compatible data, but only did so
when the numeric formatting codes were used. This made it a source of value
dependent compatibility errors based on the format string, akin to the kind
of value dependent errors seen when implicitly encoding arbitrary text as
ASCII.

Guido's successful counter was to point out that the parsing of the format
string itself assumes ASCII compatible data, thus placing at least the
mod-formatting operation in the same category as the currently existing
valid-for-sufficiently-ASCII-compatible-data only operations.

Current discussions suggest to me that the argument against implicit
encoding operations that introduce latent data driven defects may still
apply to bytes.format though, so I've reverted to being -1 on that.

Cheers,
Nick.

>And the refutation: that the bytes type
> already provides some operations that assume ASCIIness, and these new
> formatting features are no more of an attractive nuisance than those;
> since the syntax of the formatting mini-languages themselves itself
> assumes ASCIIness, there is not likely to be any temptation to use it
> with binary data that cannot.
>
> Although it can be hard to arrive at accurate and agreed-on summaries of
> the discussion, recording such summaries in the PEP is important; it may
> help save our future selves and colleagues from having to revisit all
> these same discussions and megathreads.
>
> > Overriding Principles
> > =
> >
> > In order to avoid the problems of auto-conversion and value-generated
> > exceptions,
> > all object checking will be done via isinstance, not by values contained
> > in a
> > Unicode representation.  In other words::
> >
> >   - duck-typing to allow/reject entry into a byte-stream
> >   - no value generated errors
>
> This seems self-contradictory; "isinstance" is type-checking, which is
> the opposite of duck-typing. A duck-typing implementation would not use
> isinstance, it would call / check for the existence of a certain magic
> method instead.
>
> I think it might also be good to expand (very) slightly on what "the
> problems of auto-conversion and value-generated exceptions" are; that
> is, that the benefit of Python 3's model is that encoding is explicit,
> not implicit, making it harder to unwittingly write code that works as
> long as all data is ASCII, but fails as soon as someone feeds in
> non-ASCII text data.
>
> Not everyone who reads this PEP will be steeped in years of discussion
> about the relative merits of the Python 2 vs 3 models; it doesn't hurt
> to spell out a few assumptions.
>
>
> > Proposed semantics for bytes formatting
> > ===
> >
> > %-interpolation
> > ---
> >
> > All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
> > will be supported, and will work as they do for str, including the
> > padding, justification and other related modifiers, except locale.
> >
> > Example::
> >
> >>>> b'%4x' % 10
> >b'   a'
> >
> > %c will insert a single byte, either from an int in range(256), or from
> > a bytes argument of length 1.
> >
> > Example:
> >
> > >>> b'%c' % 48
> > b'0'
> >
> > >>> b'%c' % b'a'
> > b'a'
> >
> > %s is restricted in what it will accept::
> >
> >   - input type supports Py_buffer?
> > use it to collect the necessary bytes
> >
> >   - input type is something else?
> > use its __bytes__ method; if there isn't one, raise an exception [2]
> >
> > Examples:
> >
> > >>> b'%s' % b'abc'
> > b'abc'
> >
> > >>> b'%s' % 3.14
> > Trac

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Ethan Furman


On 01/15/2014 06:12 PM, Glenn Linderman wrote:

On 1/15/2014 4:13 PM, Ethan Furman wrote:


  - no value generated errors

...


%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.


what does

x = 354
b"%c" % x

produce?  Seems that construct produces a value dependent error in both python 2 
& 3 (although it takes a much bigger
value to produce the error in python 3, with str %... with bytes %, the problem 
with be reached at 256, just like python 2).

Is this an intended exception to the overriding principle?


Hmm, thanks for spotting that.  Yes, that would be a value error if anything over 255 is used, both currently in Py2, 
and for bytes in Py3.  As Carl suggested, a little more explanation is needed in the PEP.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Ethan Furman


On 01/15/2014 05:17 PM, Carl Meyer wrote:


I think the PEP could really use a rationale section


It will have one before it's done.



Also I think it would be useful to have a section summarizing the
primary objections that have been raised, and why those objections have
been overruled


Excellent point.  That section will also be present.



In order to avoid the problems of auto-conversion and value-generated
exceptions,
all object checking will be done via isinstance, not by values contained
in a
Unicode representation.  In other words::

   - duck-typing to allow/reject entry into a byte-stream
   - no value generated errors


This seems self-contradictory; "isinstance" is type-checking, which is
the opposite of duck-typing.


Good point, I'll reword that.  It will be duck-typing.



I think it might also be good to expand (very) slightly on what "the
problems of auto-conversion and value-generated exceptions" are


Will do.



.. [2] TypeError, ValueError, or UnicodeEncodeError?


TypeError seems right to me. Definitely not UnicodeEncodeError - refusal
to implicitly encode is not at all the same thing as an encoding error.


That's the direction I'm leaning, too.

Thanks for your comments!

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Greg Ewing


Glenn Linderman wrote:


x = 354
b"%c" % x

Is this an intended exception to the overriding principle?


I think it's an unavoidable one, unless we want to
introduce an "integer in the range 0-255" type. But
that would just push the problem into another place,
since

   b"%c" % byte(x)

would then blow up on byte(x) if x were out of
range.

If you really want to make sure it won't crash, you
can always do

  b"%c" % (x & 0xff)

or whatever your favourite method of mangling out-
of-range ints is.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Guido van Rossum

Surprisingly, in this case the exception is just what the doctor ordered. :-)

On Wed, Jan 15, 2014 at 6:12 PM, Glenn Linderman  wrote:
> On 1/15/2014 4:13 PM, Ethan Furman wrote:
>
>   - no value generated errors
>
> ...
>
>
> %c will insert a single byte, either from an int in range(256), or from
> a bytes argument of length 1.
>
>
> what does
>
> x = 354
> b"%c" % x
>
> produce?  Seems that construct produces a value dependent error in both
> python 2 & 3 (although it takes a much bigger value to produce the error in
> python 3, with str %... with bytes %, the problem with be reached at 256,
> just like python 2).
>
> Is this an intended exception to the overriding principle?
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/guido%40python.org
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Glenn Linderman


On 1/15/2014 4:13 PM, Ethan Furman wrote:
  - no value generated errors 

...


%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1. 


what does

x = 354
b"%c" % x

produce?  Seems that construct produces a value dependent error in both 
python 2 & 3 (although it takes a much bigger value to produce the error 
in python 3, with str %... with bytes %, the problem with be reached at 
256, just like python 2).


Is this an intended exception to the overriding principle?

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

2014-01-15 Thread Carl Meyer

Hi Ethan,

I haven't chimed into this discussion, but the direction it's headed
recently seems right to me. Thanks for putting together a PEP. Some
comments on it:

On 01/15/2014 05:13 PM, Ethan Furman wrote:
> 
> Abstract
> 
> 
> This PEP proposes adding the % and {} formatting operations from str to
> bytes [1].

I think the PEP could really use a rationale section summarizing _why_
these formatting operations are being added to bytes; namely that they
are useful when working with various ASCIIish-but-not-properly-text
network protocols and file formats, and in particular when porting code
dealing with such formats/protocols from Python 2.

Also I think it would be useful to have a section summarizing the
primary objections that have been raised, and why those objections have
been overruled (presuming the PEP is accepted). For instance: the main
objection, AIUI, has been that the bytes type is for pure bytes-handling
with no assumptions about encoding, and thus we should not add features
to it that assume ASCIIness, and that may be attractive nuisances for
people writing bytes-handling code that should not assume ASCIIness but
will once they use the feature. And the refutation: that the bytes type
already provides some operations that assume ASCIIness, and these new
formatting features are no more of an attractive nuisance than those;
since the syntax of the formatting mini-languages themselves itself
assumes ASCIIness, there is not likely to be any temptation to use it
with binary data that cannot.

Although it can be hard to arrive at accurate and agreed-on summaries of
the discussion, recording such summaries in the PEP is important; it may
help save our future selves and colleagues from having to revisit all
these same discussions and megathreads.

> Overriding Principles
> =
> 
> In order to avoid the problems of auto-conversion and value-generated
> exceptions,
> all object checking will be done via isinstance, not by values contained
> in a
> Unicode representation.  In other words::
> 
>   - duck-typing to allow/reject entry into a byte-stream
>   - no value generated errors

This seems self-contradictory; "isinstance" is type-checking, which is
the opposite of duck-typing. A duck-typing implementation would not use
isinstance, it would call / check for the existence of a certain magic
method instead.

I think it might also be good to expand (very) slightly on what "the
problems of auto-conversion and value-generated exceptions" are; that
is, that the benefit of Python 3's model is that encoding is explicit,
not implicit, making it harder to unwittingly write code that works as
long as all data is ASCII, but fails as soon as someone feeds in
non-ASCII text data.

Not everyone who reads this PEP will be steeped in years of discussion
about the relative merits of the Python 2 vs 3 models; it doesn't hurt
to spell out a few assumptions.

> Proposed semantics for bytes formatting
> ===
> 
> %-interpolation
> ---
> 
> All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
> will be supported, and will work as they do for str, including the
> padding, justification and other related modifiers, except locale.
> 
> Example::
> 
>>>> b'%4x' % 10
>b'   a'
> 
> %c will insert a single byte, either from an int in range(256), or from
> a bytes argument of length 1.
> 
> Example:
> 
> >>> b'%c' % 48
> b'0'
> 
> >>> b'%c' % b'a'
> b'a'
> 
> %s is restricted in what it will accept::
> 
>   - input type supports Py_buffer?
> use it to collect the necessary bytes
> 
>   - input type is something else?
> use its __bytes__ method; if there isn't one, raise an exception [2]
> 
> Examples:
> 
> >>> b'%s' % b'abc'
> b'abc'
> 
> >>> b'%s' % 3.14
> Traceback (most recent call last):
> ...
> TypeError: 3.14 has no __bytes__ method
> 
> >>> b'%s' % 'hello world!'
> Traceback (most recent call last):
> ...
> TypeError: 'hello world' has no __bytes__ method, perhaps you need
> to encode it?
> 
> .. note::
> 
>Because the str type does not have a __bytes__ method, attempts to
>directly use 'a string' as a bytes interpolation value will raise an
>exception.  To use 'string' values, they must be encoded or otherwise
>transformed into a bytes sequence::
> 
>   'a string'.encode('latin-1')
> 
> format
> --
> 
> The format mini language codes, where they correspond with the
> %-interpolation codes,
> will be used as-is, with three exceptions::
> 
>   - !s is not supported, as {} can mean the default for both str and
> bytes, in both
> Py2 and Py3.
>   - !b is supported, and new Py3k code can use it to be explicit.
>   - no other __format__ method will be called.
> 
> Numeric Format Codes
> 
> 
> To properly handle int and float subclasses, int(), index(), and float()
> will be called on the
> obje

[Python-Dev] PEP 461 updates

2014-01-15 Thread Ethan Furman


Current copy of PEP, many modifications from all the feedback.  Thank you to 
everyone.

I know it's been a long week (feels a lot longer!) while all this was hammered 
out, but I think we're getting close!


Abstract


This PEP proposes adding the % and {} formatting operations from str to bytes 
[1].


Overriding Principles
=

In order to avoid the problems of auto-conversion and value-generated 
exceptions,
all object checking will be done via isinstance, not by values contained in a
Unicode representation.  In other words::

  - duck-typing to allow/reject entry into a byte-stream
  - no value generated errors


Proposed semantics for bytes formatting
===

%-interpolation
---

All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
will be supported, and will work as they do for str, including the
padding, justification and other related modifiers, except locale.

Example::

   >>> b'%4x' % 10
   b'   a'

%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.

Example:

>>> b'%c' % 48
b'0'

>>> b'%c' % b'a'
b'a'

%s is restricted in what it will accept::

  - input type supports Py_buffer?
use it to collect the necessary bytes

  - input type is something else?
use its __bytes__ method; if there isn't one, raise an exception [2]

Examples:

>>> b'%s' % b'abc'
b'abc'

>>> b'%s' % 3.14
Traceback (most recent call last):
...
TypeError: 3.14 has no __bytes__ method

>>> b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: 'hello world' has no __bytes__ method, perhaps you need to 
encode it?

.. note::

   Because the str type does not have a __bytes__ method, attempts to
   directly use 'a string' as a bytes interpolation value will raise an
   exception.  To use 'string' values, they must be encoded or otherwise
   transformed into a bytes sequence::

  'a string'.encode('latin-1')

format
--

The format mini language codes, where they correspond with the %-interpolation 
codes,
will be used as-is, with three exceptions::

  - !s is not supported, as {} can mean the default for both str and bytes, in 
both
Py2 and Py3.
  - !b is supported, and new Py3k code can use it to be explicit.
  - no other __format__ method will be called.

Numeric Format Codes


To properly handle int and float subclasses, int(), index(), and float() will 
be called on the
objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G).

Unsupported codes
-

%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not 
supported.

!r and !a are not supported.

The n integer and float format code is not supported.


Open Questions
==

Currently non-numeric objects go through::

  - Py_buffer
  - __bytes__
  - failure

Do we want to add a __format_bytes__ method in there?

  - Guaranteed to produce only ascii (as in b'10', not b'\x0a')
  - Makes more sense than using __bytes__ to produce ascii output
  - What if an object has both __bytes__ and __format_bytes__?

Do we need to support all the numeric format codes?  The floating point
exponential formats seem less appropriate, for example.


Proposed variations
===

It was suggested to let %s accept numbers, but since numbers have their own
format codes this idea was discarded.

It has been suggested to use %b for bytes instead of %s.

  - Rejected as %b does not exist in Python 2.x %-interpolation, which is
why we are using %s.

It has been proposed to automatically use .encode('ascii','strict') for str
arguments to %s.

  - Rejected as this would lead to intermittent failures.  Better to have the
operation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have %s return the ascii-encoded repr when the value
is a str  (b'%s' % 'abc'  --> b"'abc'").

  - Rejected as this would lead to hard to debug failures far from the problem
site.  Better to have the operation always fail so the trouble-spot can be
easily fixed.


Footnotes
=

.. [1] string.Template is not under consideration.
.. [2] TypeError, ValueError, or UnicodeEncodeError?
==

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

Re: [Python-Dev] PEP 461 updates

[Python-Dev] PEP 461 updates

30 matches

Site Navigation

Mail list logo

Footer information