Re: [Python-Dev] PEP 461 updates
On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin wrote: > > long as numpy.loadtxt is explicitly documented as only working with > > latin-1 encoded files (it currently isn't), there's no problem. > > Actually there is problem. If it explicitly specified the encoding as > latin-1 when opening the file then it could document the fact that it > works for latin-1 encoded files. However it actually uses the system > default encoding to read the file which is a really bad default -- oh well. Also, I don't think it was a choice, at least not a well thought out one, but rather what fell out of tryin gto make it "just work" on py3. and then converts the strings to > bytes with the as_bytes function that is hard-coded to use latin-1: > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > > So it only works if the system default encoding is latin-1 and the > file content is white-space and newline compatible with latin-1. > Regardless of whether the file itself is in utf-8 or latin-1 it will > only work if the system default encoding is latin-1. I've never used a > system that had latin-1 as the default encoding (unless you count > cp1252 as latin-1). > even if it was a common default it would be a "bad idea". Fortunately (?), so it really is broken, we can fix it without being too constrained by backwards compatibility. > > > If it's supposed to work with other encodings (but the entire file is > > still required to use a consistent encoding), then it just needs > > encoding and errors arguments to fit the Python 3 text model (with > > "latin-1" documented as the default encoding). > > This is the right solution. Have an encoding argument, document the > fact that it will use the system default encoding if none is > specified, and re-encode using the same encoding to fit any dtype='S' > bytes column. This will then work for any encoding including the ones > that aren't ASCII-compatible (e.g. utf-16). > Exactly, except I dont think the system encoding as a default is a good choice. If there is a default MOST people will use it. And it will work for a lot of their test code. Then it will break if the code is passed to a system with a different default encoding, or a file comes from another source in a different encoding. This is very, very likely. Far more likely that files consistently being in the system encoding > > default behaviour, since passing something like > > codecs.getdecoder("utf-8") as a column converter should do the right > > thing. > that seems to work at the moment, actually, if done with care. That's just getting silly IMO. If the file uses mixed encodings then I > don't consider it a valid "text file" and see no reason for loadtxt to > support reading it. agreed -- that's just getting crazy -- the only use-case I can image is to clean up a file that got moji-baked by some other process -- not really the use case for loadtxt and friends. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 19 January 2014 06:19, Nick Coghlan wrote: > > While I agree it's not relevant to the PEP 460/461 discussions, so > long as numpy.loadtxt is explicitly documented as only working with > latin-1 encoded files (it currently isn't), there's no problem. Actually there is problem. If it explicitly specified the encoding as latin-1 when opening the file then it could document the fact that it works for latin-1 encoded files. However it actually uses the system default encoding to read the file and then converts the strings to bytes with the as_bytes function that is hard-coded to use latin-1: https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 So it only works if the system default encoding is latin-1 and the file content is white-space and newline compatible with latin-1. Regardless of whether the file itself is in utf-8 or latin-1 it will only work if the system default encoding is latin-1. I've never used a system that had latin-1 as the default encoding (unless you count cp1252 as latin-1). > If it's supposed to work with other encodings (but the entire file is > still required to use a consistent encoding), then it just needs > encoding and errors arguments to fit the Python 3 text model (with > "latin-1" documented as the default encoding). This is the right solution. Have an encoding argument, document the fact that it will use the system default encoding if none is specified, and re-encode using the same encoding to fit any dtype='S' bytes column. This will then work for any encoding including the ones that aren't ASCII-compatible (e.g. utf-16). Then instead of having a compat module with an as_bytes helper to get rid of all the unicode strings on Python 3, you can have a compat module with an open_unicode helper to do the right thing on Python 2. The as_bytes function is just a way of fighting the Python 3 text model: "I don't care about mojibake just do whatever it takes to shut up the interpreter and its error messages and make sure it works for ASCII data." > If it is intended to > allow S columns to contain text in arbitrary encodings, then that > should also be supported by the current API with an adjustment to the > default behaviour, since passing something like > codecs.getdecoder("utf-8") as a column converter should do the right > thing. However, if you're currently decoding S columns with latin-1 > *before* passing the value to the converter, then you'll need to use a > WSGI style decoding dance instead: > > def fix_encoding(text): > return text.encode("latin-1").decode("utf-8") # For example That's just getting silly IMO. If the file uses mixed encodings then I don't consider it a valid "text file" and see no reason for loadtxt to support reading it. > That's more wasteful than just passing the raw bytes through for > decoding, but is the simplest backwards compatible option if you're > doing latin-1 decoding already. > > If different rows in the *same* column are allowed to have different > encodings, then that's not a valid use of the operation (since the > column converter has no access to the rest of the row to determine > what encoding should be used for the decode operation). Ditto. Oscar ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 19 January 2014 00:39, Oscar Benjamin wrote: > > If you want to draw a relevant lesson from that thread in this one > then the lesson argues against PEP 461: adding back the bytes > formatting methods helps people who refuse to understand text > processing and continue implementing dirty hacks instead of doing it > properly. Yes, that's why it has taken so long to even *consider* bringing binary interpolation support back - one of our primary concerns in the early days of Python 3 was developers (including core developers!) attempting to translate bad habits from Python 2 into Python 3 by continuing to treat binary data as text. Making interpolation a purely text domain operation helped strongly in enforcing this distinction, as it generally required thinking about encoding issues in order to get things into the text domain (or hitting them with the "latin-1" hammer, in which case... *sigh*). The reason PEP 460/461 came up is that we *do* acknowledge that there is a legitimate use case for binary interpolation support when dealing with binary formats that contain ASCII compatible segments. Now that people have had a few years to get used to the Python 3 text model , lowering the barrier to migration from Python 2 and better handling that use case in Python 3 in general has finally tilted the scales in favour of providing the feature (assuming Guido is happy with PEP 461 after Ethan finishes the Rationale section). (Tangent) While I agree it's not relevant to the PEP 460/461 discussions, so long as numpy.loadtxt is explicitly documented as only working with latin-1 encoded files (it currently isn't), there's no problem. If it's supposed to work with other encodings (but the entire file is still required to use a consistent encoding), then it just needs encoding and errors arguments to fit the Python 3 text model (with "latin-1" documented as the default encoding). If it is intended to allow S columns to contain text in arbitrary encodings, then that should also be supported by the current API with an adjustment to the default behaviour, since passing something like codecs.getdecoder("utf-8") as a column converter should do the right thing. However, if you're currently decoding S columns with latin-1 *before* passing the value to the converter, then you'll need to use a WSGI style decoding dance instead: def fix_encoding(text): return text.encode("latin-1").decode("utf-8") # For example That's more wasteful than just passing the raw bytes through for decoding, but is the simplest backwards compatible option if you're doing latin-1 decoding already. If different rows in the *same* column are allowed to have different encodings, then that's not a valid use of the operation (since the column converter has no access to the rest of the row to determine what encoding should be used for the decode operation). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 17 January 2014 21:37, Chris Barker wrote: > > For the record, we've got a pretty good thread (not this good, though!) over > on the numpy list about how to untangle the mess that has resulted from > porting text-file-parsing code to py3 (and the underlying issue with the 'S' > data type in numpy...) > > One note from the github issue: > """ > The use of asbytes originates only from the fact that b'%d' % (20,) does > not work. > """ > > So yeah PEP 461! (even if too late for numpy...) The discussion about numpy.loadtxt and the 'S' dtype is not relevant to PEP 461. PEP 461 is about facilitating handling ascii/binary protocols and file formats. The loadtxt function is for reading text files. Reading text files is already handled very well in Python 3. The only caveat is that you need to specify the encoding when you open the file. The loadtxt function doesn't specify the encoding when it opens the file so on Python 3 it gets the system default encoding when reading from the file. Since the 'S' dtype is for an array of bytes the loadtxt function has to encode the unicode strings before storing them in the array. The function has no idea what encoding the user wants so it just uses latin-1 leading to mojibake if the file content and encoding are not compatible with latin-1 e.g.: utf-8. The loadtxt function is a classic example of how *not* to do text and whoever made it that way probably didn't understand unicode and the Python 3 text model. If they did understand what they were doing then they knew that they were implementing a dirty hack. If you want to draw a relevant lesson from that thread in this one then the lesson argues against PEP 461: adding back the bytes formatting methods helps people who refuse to understand text processing and continue implementing dirty hacks instead of doing it properly. Oscar ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 1/17/2014 4:37 PM, Chris Barker wrote: > For the record, we've got a pretty good thread (not this good, though!) > over on the numpy list about how to untangle the mess that has resulted > from porting text-file-parsing code to py3 (and the underlying issue > with the 'S' data type in numpy...) > > One note from the github issue: > """ > The use of asbytes originates only from the fact that b'%d' % (20,) > does not work. > """ > > So yeah PEP 461! (even if too late for numpy...) Would they use "(u'%d' % (20,)).encode('ascii')" for that? Just curious as to what they're planning on doing. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
I hope you didn't mean to take this off-list: On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer wrote: > In gmane.comp.python.devel, you wrote: > > For the record, we've got a pretty good thread (not this good, though!) > > over on the numpy list about how to untangle the mess that has resulted > > Not sure about your definition of good. ;-) well, in the sense of "big" anyway... > Could you summarize the main points on python-dev? I'm not feeling up to > wading through > another massive thread but I'm quite interested to hear the > challenges that numpy deals with. Well, not much new to it, really. But here's a re-cap: numpy has had an 'S' dtype for a while, which corresponded to the py2 string type (except for being fixed length). So it could auto-convert to-from python strings... all was good and happy. Enter py3: what to do? there is no py2 string type anymore. So it was decided to have the 'S' dtype correspond to the py3 bytes type. Apparently there was thought of renaming it, but the 'B' and 'b' type identifiers were already takes, so 'S' was kept. However, as we all know in this thread, the py3 bytes type is not the same thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S' type for text data -- so that is kind of broken in py3. However, other folks use the 'S' type for binary data, so like (and rely on) it being mapped to the py3 bytes type. So we are stuck with that. Given the nature of numpy, and scientific data, there is talk of having a one-byte-per-char text type in numpy (there is already a unicode type, but it uses 4-bytes-per-char, as it's key to the numpy data model that all objects of a given type are the same size.) This would be analogous to the current multiple precision options for numbers. It would take up less memory, and would not be able to hold all values. It's not clear what the level of support is for this right now -- after all, you can do everything you need to do with the appropriate calls to encode() and decode(), if a bit awkward. Meanwhile, back at the ranch -- related, but separate issues have arisen with the functions that parse text files: numpy.loadtxt and numpy.genfromtxt. These functions were adapted for py3 just enough to get things to mostly work, but have some serious limitations when doing anything with unicode -- and in fact do some weird things with plain ascii text files if you ask it to create unicode objects, and that is a natural thing to do (and the "right" thing to do in the Py3 text model) if you do something like: arr = loadtxt('a_file_name', dtype=str) on py3, an str is a py3unicode string, so you get the numpy 'U' datatype but loadtxt wasn't designed to deal with that, so you can get stuff like: ["b'C:UsersDocumentsProjectmytextfile1.txt'" "b'C:UsersDocumentsProjectmytextfile2.txt'" "b'C:UsersDocumentsProjectmytextfile3.txt'"] This was (Presumably, I haven't debugged the code) due to conversion from bytes to unicode...(I'm still confused about the extra slashes) And this ascii text -- it gets worse if there is non-ascii text in there. Anyway, the truth is, this stuff is hard, but it will get at least a touch easier with PEP 461. [though to be truthful, I'm not sure why someone put a comment in the issue tracker about b'%d'%some_num being an issue ... I'm not sure how when we're going from text to numbers, not the other way around...] -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
For the record, we've got a pretty good thread (not this good, though!) over on the numpy list about how to untangle the mess that has resulted from porting text-file-parsing code to py3 (and the underlying issue with the 'S' data type in numpy...) One note from the github issue: """ The use of asbytes originates only from the fact that b'%d' % (20,) does not work. """ So yeah PEP 461! (even if too late for numpy...) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Steven D'Aprano writes: > On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote: > > "ASCII compatible" is a technical term in encodings, which means > > "bytes in the range 0-127 always have ASCII coded character semantics, > > do what you like with bytes in the range 128-255."[1] > > Examples, and counter-examples, may help. Let me see if I have got this > right: an ASCII-compatible encoding may be an ASCII-superset like > Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars > are encoded to the same bytes as ASCII, and non-ASCII chars are not. A > counter-example would be UTF-16, or some of the Asian encodings like > Big5. Am I right so far? All correct. > But Nick isn't talking about an encoding, he's talking about a data > format. I think that an ASCII-compatible format means one where (in at > least *some* parts of the data) bytes between 0 and 127 have the same > meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII > character "T". This doesn't mean that every byte 84 means "T", only that > some of them do -- hopefully a well-defined sections of the data. Below, > you introduce the term "ASCII segments" for these. Yes, except that I believe Nick, as well as the "file-and-wire guys", strengthen "hopefully well-defined" to just "well-defined". > > are designed for use *only* on bytes > > that are ASCII segments; use on other data is likely to cause > > hard-to-diagnose corruption. > > An example: if you have the byte b'\x63', calling upper() on that will > return b'\x43'. That is only meaningful if the byte is intended as the > ASCII character "c". Good example. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 1/16/2014 9:46 PM, Nick Coghlan wrote: On 17 January 2014 11:51, Ethan Furman wrote: On 01/16/2014 05:32 PM, Greg wrote: I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing. And a good thing, too, on both counts! :) A few folks have suggested not implementing .format() on bytes; I've been resistant, but then I remembered that format is also a function. http://docs.python.org/3/library/functions.html?highlight=ascii#format == format(value[, format_spec]) Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec will depend on the type of the value argument, however there is a standard formatting syntax that is used by most built-in types: Format Specification Mini-Language. The default format_spec is an empty string which usually gives the same effect as calling str(value). A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is not found or if either the format_spec or the return value are not strings. == Given that, I can relent on .format and just go with .__mod__ . A low-level service for a low-level protocol, what? ;) Exactly - while I'm a fan of the new extensible formatting system and strongly prefer it to printf-style formatting for text, it also has a whole lot of complexity that is hard to translate to the binary domain, including the format() builtin and __format__ methods. Since the relevant use cases appear to be already covered adequately by prinft-style formatting, attempting to translate the flexible text formatting system as well just becomes additional complexity we don't need. I like Stephen Turnbull's suggestion of using "binary formats with ASCII segments" to distinguish the kind of formats we're talking about from ASCII compatible text encodings, I liked that too, and almost said so on his posting, but will say it here, instead. and I think Python 3.5 will end up with a suite of solutions that suitably covers all use cases, just by bringing back printf-style formatting directly to bytes: * format(), str.format(), str.format_map(): a rich extensible text formatting system, including date interpolation support * str.__mod__: retained primarily for backwards compatibility, may occasionally be used as a text formatting optimisation tool (since the inflexibility means it will likely always be marginally faster than the rich formatting system for the cases that it covers) * bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify production of data in variable length binary formats that contain ASCII segments * the struct module: rich (but not extensible) formatting system for fixed length binary formats Adding format codes with variable length could enhance the struct module to additional uses. C structs, on which it is modeled, often get around the difficulty of variable length items by defining one variable length item at the end, or by defining offsets in the fixed part, to variable length parts that follows. Such a structure cannot presently be created by struct alone. In Python 2, the binary format with ASCII segments use case was intermingled with general purpose text formatting on the str type, which is I think the main reason it has taken us so long to convince ourselves it is something that is genuinely worth bringing back in a more limited form in Python 3, rather than just being something we wanted back because we were used to having it in Python 2. Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 17 January 2014 11:51, Ethan Furman wrote: > On 01/16/2014 05:32 PM, Greg wrote: >> >> >> I don't think it matters whether the internal details of that >> debate make sense to the rest of us. The main thing is that >> a consensus seems to have been reached on bytes formatting >> being basically a good thing. > > > And a good thing, too, on both counts! :) > > A few folks have suggested not implementing .format() on bytes; I've been > resistant, but then I remembered that format is also a function. > > http://docs.python.org/3/library/functions.html?highlight=ascii#format > == > format(value[, format_spec]) > > Convert a value to a “formatted” representation, as controlled by > format_spec. The interpretation of format_spec will depend on the type of > the value argument, however there is a standard formatting syntax that is > used by most built-in types: Format Specification Mini-Language. > > The default format_spec is an empty string which usually gives the same > effect as calling str(value). > > A call to format(value, format_spec) is translated to > type(value).__format__(format_spec) which bypasses the instance dictionary > when searching for the value’s __format__() method. A TypeError exception is > raised if the method is not found or if either the format_spec or the return > value are not strings. > == > > Given that, I can relent on .format and just go with .__mod__ . A low-level > service for a low-level protocol, what? ;) Exactly - while I'm a fan of the new extensible formatting system and strongly prefer it to printf-style formatting for text, it also has a whole lot of complexity that is hard to translate to the binary domain, including the format() builtin and __format__ methods. Since the relevant use cases appear to be already covered adequately by prinft-style formatting, attempting to translate the flexible text formatting system as well just becomes additional complexity we don't need. I like Stephen Turnbull's suggestion of using "binary formats with ASCII segments" to distinguish the kind of formats we're talking about from ASCII compatible text encodings, and I think Python 3.5 will end up with a suite of solutions that suitably covers all use cases, just by bringing back printf-style formatting directly to bytes: * format(), str.format(), str.format_map(): a rich extensible text formatting system, including date interpolation support * str.__mod__: retained primarily for backwards compatibility, may occasionally be used as a text formatting optimisation tool (since the inflexibility means it will likely always be marginally faster than the rich formatting system for the cases that it covers) * bytes.__mod__, bytearray.__mod__: restored in Python 3.5 to simplify production of data in variable length binary formats that contain ASCII segments * the struct module: rich (but not extensible) formatting system for fixed length binary formats In Python 2, the binary format with ASCII segments use case was intermingled with general purpose text formatting on the str type, which is I think the main reason it has taken us so long to convince ourselves it is something that is genuinely worth bringing back in a more limited form in Python 3, rather than just being something we wanted back because we were used to having it in Python 2. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote: > Meta enough that I'll take Guido out of the CC. > > Nick Coghlan writes: > > > There are plenty of data formats (like SMTP and HTTP) that are > > constrained to be ASCII compatible, > > "ASCII compatible" is a technical term in encodings, which means > "bytes in the range 0-127 always have ASCII coded character semantics, > do what you like with bytes in the range 128-255."[1] Examples, and counter-examples, may help. Let me see if I have got this right: an ASCII-compatible encoding may be an ASCII-superset like Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars are encoded to the same bytes as ASCII, and non-ASCII chars are not. A counter-example would be UTF-16, or some of the Asian encodings like Big5. Am I right so far? But Nick isn't talking about an encoding, he's talking about a data format. I think that an ASCII-compatible format means one where (in at least *some* parts of the data) bytes between 0 and 127 have the same meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII character "T". This doesn't mean that every byte 84 means "T", only that some of them do -- hopefully a well-defined sections of the data. Below, you introduce the term "ASCII segments" for these. > Worse, it's clearly confusing in this discussion. Let's stop using > this term to mean > > the data format has elements that are defined to contain only > bytes with ASCII coded character semantics > > (which is the relevant restriction AFAICS -- I don't know of any > ASCII-compatible formats where the bytes 128-255 are used for any > purpose other than encoding non-ASCII characters). OTOH, if it *is* > an ASCII-compatible text encoding, the semantics are dubious if the > bytes versions of many of these methods/operations are used. > > A documentation suggestion: It's easy enough to rewrite > > > constrained to be ASCII compatible, either globally, or locally in > > the parts being manipulated by an application (such as a file > > header). ASCII incompatible segments may be present, but in ways > > that allow the data processing to handle them correctly. > > as > > containing 'well-defined segments constrained to be (strictly) > ASCII-encoded' (aka ASCII segments). > > And then you can say > > are designed for use *only* on bytes > that are ASCII segments; use on other data is likely to cause > hard-to-diagnose corruption. An example: if you have the byte b'\x63', calling upper() on that will return b'\x43'. That is only meaningful if the byte is intended as the ASCII character "c". > Footnotes: > [1] "ASCII coded character semantics" is of course mildly ambiguous > due to considerations like EOL conventions. But "you know what I'm > talking about". I think I know what your talking about, but don't know for sure unless I explain it back to you. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Greg wrote: > I don't think it matters whether the internal details of that > debate make sense to the rest of us. The main thing is that > a consensus seems to have been reached on bytes formatting > being basically a good thing. I've been mostly steering clear of the metaphysical and writing code today. ;-) An extremely rough patch has been uploaded: http://bugs.python.org/issue20284 I have a new one almost ready that introduces __ascii__ rather than overloading __format__. I like it better, will upload to issue tracker soon. Regards, Neil ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Greg writes: > I don't think it matters whether the internal details of [the EIBTI > vs. PBP] debate make sense to the rest of us. The main thing is > that a consensus seems to have been reached on bytes formatting > being basically a good thing. I think some of it matters to the documentation. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Meta enough that I'll take Guido out of the CC. Nick Coghlan writes: > There are plenty of data formats (like SMTP and HTTP) that are > constrained to be ASCII compatible, "ASCII compatible" is a technical term in encodings, which means "bytes in the range 0-127 always have ASCII coded character semantics, do what you like with bytes in the range 128-255."[1] Worse, it's clearly confusing in this discussion. Let's stop using this term to mean the data format has elements that are defined to contain only bytes with ASCII coded character semantics (which is the relevant restriction AFAICS -- I don't know of any ASCII-compatible formats where the bytes 128-255 are used for any purpose other than encoding non-ASCII characters). OTOH, if it *is* an ASCII-compatible text encoding, the semantics are dubious if the bytes versions of many of these methods/operations are used. A documentation suggestion: It's easy enough to rewrite > constrained to be ASCII compatible, either globally, or locally in > the parts being manipulated by an application (such as a file > header). ASCII incompatible segments may be present, but in ways > that allow the data processing to handle them correctly. as containing 'well-defined segments constrained to be (strictly) ASCII-encoded' (aka ASCII segments). And then you can say are designed for use *only* on bytes that are ASCII segments; use on other data is likely to cause hard-to-diagnose corruption. If there are other use cases for "ASCII-compatible data formats" as defined above (not worrying about codecs, because they are a very small minority of code-to-be-written at this point), I don't know about them. Does anyone? If there are any, I'll be happy to revise. If not, that seems to be a precise and intelligible statement of the restrictions that is useful to the practical use cases. And nothing stops users who think they know what they're doing from using them in other contexts (which can be documented if they turn out to be broadly useful). Footnotes: [1] "ASCII coded character semantics" is of course mildly ambiguous due to considerations like EOL conventions. But "you know what I'm talking about". ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 01/16/2014 05:32 PM, Greg wrote: I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing. And a good thing, too, on both counts! :) A few folks have suggested not implementing .format() on bytes; I've been resistant, but then I remembered that format is also a function. http://docs.python.org/3/library/functions.html?highlight=ascii#format == format(value[, format_spec]) Convert a value to a “formatted” representation, as controlled by format_spec. The interpretation of format_spec will depend on the type of the value argument, however there is a standard formatting syntax that is used by most built-in types: Format Specification Mini-Language. The default format_spec is an empty string which usually gives the same effect as calling str(value). A call to format(value, format_spec) is translated to type(value).__format__(format_spec) which bypasses the instance dictionary when searching for the value’s __format__() method. A TypeError exception is raised if the method is not found or if either the format_spec or the return value are not strings. == Given that, I can relent on .format and just go with .__mod__ . A low-level service for a low-level protocol, what? ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 17/01/2014 10:18 a.m., Terry Reedy wrote: On 1/16/2014 5:11 AM, Nick Coghlan wrote: Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data, Nick's initial arguments against bytes formatting were very abstract and philosophical, along the lines that it violated some pure mental model of text/bytes separation. Then Guido said something that Nick took to be an equal and opposite philosophical argument that cancelled out his original objections, and he withdrew them. I don't think it matters whether the internal details of that debate make sense to the rest of us. The main thing is that a consensus seems to have been reached on bytes formatting being basically a good thing. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 17 Jan 2014 09:36, "Terry Reedy" wrote: > > On 1/16/2014 4:59 PM, Guido van Rossum wrote: > >> I'm getting tired of "did you understand what I said". > > > I was asking whether I needed to repeat myself, but forget that. > I was also saying that while I understand 'ascii-compatible encoding', I do not understand the notion of 'ascii-compatible data' or statements based on it. There are plenty of data formats (like SMTP and HTTP) that are constrained to be ASCII compatible, either globally, or locally in the parts being manipulated by an application (such as a file header). ASCII incompatible segments may be present, but in ways that allow the data processing to handle them correctly. The ASCII assuming methods on bytes objects are there to help in dealing with that kind of data. If the binary data is just one large block in a single text encoding, it's generally easier to just decode it to text, but multipart formats generally don't allow that. > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 1/16/2014 4:59 PM, Guido van Rossum wrote: I'm getting tired of "did you understand what I said". I was asking whether I needed to repeat myself, but forget that. I was also saying that while I understand 'ascii-compatible encoding', I do not understand the notion of 'ascii-compatible data' or statements based on it. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On Thu, Jan 16, 2014 at 1:18 PM, Terry Reedy wrote: > On 1/16/2014 5:11 AM, Nick Coghlan wrote: > >> Guido's successful counter was to point out that the parsing of the >> format string itself assumes ASCII compatible data, > > Did you see my explanation, which I wrote in response to one of your earlier > posts, of why I think "the parsing of the format string itself assumes ASCII > compatible data" that statement is confused and wrong? The above seems to > say that what I wrote is impossible, but perhaps I misunderstand what Guido > and you mean. Among my questions are "by data, do you mean interpolated > objects or interpolated bytes?" and "what restriction on 'data' do you > intend by 'ASCII compatible'?". Can you move the meta-discussion off-list? I'm getting tired of "did you understand what I said". -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 1/16/2014 5:11 AM, Nick Coghlan wrote: Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data, Did you see my explanation, which I wrote in response to one of your earlier posts, of why I think "the parsing of the format string itself assumes ASCII compatible data" that statement is confused and wrong? The above seems to say that what I wrote is impossible, but perhaps I misunderstand what Guido and you mean. Among my questions are "by data, do you mean interpolated objects or interpolated bytes?" and "what restriction on 'data' do you intend by 'ASCII compatible'?". -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Carl Meyer wrote: > I think the PEP could really use a rationale section summarizing _why_ > these formatting operations are being added to bytes I agree. My attempt at re-writing the PEP is below. >> In order to avoid the problems of auto-conversion and >> value-generated exceptions, all object checking will be done via >> isinstance, not by values contained in a Unicode representation. >> In other words:: >> >> - duck-typing to allow/reject entry into a byte-stream >> - no value generated errors > > This seems self-contradictory; "isinstance" is type-checking, which is > the opposite of duck-typing. Again, I agree. We should avoid isinstance checks if possible. Abstract This PEP proposes adding %-interpolation to the bytes object. Rational A distruptive but useful change introduced in Python 3.0 was the clean separation of byte strings (i.e. the "bytes" object) from character strings (i.e. the "str" object). The benefit is that character encodings must be explicitly specified and the risk of corrupting character data is reduced. Unfortunately, this separation has made writing certain types of programs more complicated and verbose. For example, programs that deal with network protocols often manipulate ASCII encoded strings. Since the "bytes" type does not support string formatting, extra encoding and decoding between the "str" type is required. For simplicity and convenience it is desireable to introduce formatting methods to "bytes" that allow formatting of ASCII-encoded character data. This change would blur the clean separation of byte strings and character strings. However, it is felt that the practical benefits outweigh the purity costs. The implicit assumption of ASCII-encoding would be limited to formatting methods. One source of many problems with the Python 2 Unicode implementation is the implicit coercion of Unicode character strings into byte strings using the "ascii" codec. If the character strings contain only ASCII characters, all was well. However, if the string contains a non-ASCII character then coercion causes an exception. The combination of implicit coercion and value dependent failures has proven to be a recipe for hard to debug errors. A program may seem to work correctly when tested (e.g. string input that happened to be ASCII only) but later would fail, often with a traceback far from the source of the real error. The formatting methods for bytes() should avoid this problem by not implicitly encoding data that might fail based on the content of the data. Another desirable feature is to allow arbitrary user classes to be used as formatting operands. Generally this is done by introducing a special method that can be implemented by the new class. Proposed semantics for bytes formatting === Special method __ascii__ A new special method, analogous to __format__, is introduced. This method takes a single argument, a format specifier. The return value is a bytes object. Objects that have an ASCII only representation can implement this method to allow them to be used as format operators. Objects with natural byte representations should implement __bytes__ or the Py_buffer API. %-interpolation --- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. To avoid having to introduce two special methods, the format specifications will be translated to equivalent __format__ specifiers and __ascii__ method of each argument would be called. Example:: >>> b'%4x' % 10 b' a' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s is a restricted in what it will accept:: - input type supports Py_buffer or has __bytes__? use it to collect the necessary bytes (may contain non-ASCII characters) - input type is something else? use its __ascii__ method; if there isn't one, raise TypeErorr Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 b'3.14' >>> b'%4s' % 12 b' 12' >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __ascii__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __ascii__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') Unsupported % format codes ^^ %r (which calls __repr__) is not supported format -- The format() method will not be implemented at this time but may be added in a later Python release. The __ascii__ method is
Re: [Python-Dev] PEP 461 updates
On 01/16/2014 04:49 AM, Michael Urman wrote: On Thu, Jan 16, 2014 at 1:52 AM, Ethan Furman wrote: Is this an intended exception to the overriding principle? Hmm, thanks for spotting that. Yes, that would be a value error if anything over 255 is used, both currently in Py2, and for bytes in Py3. As Carl suggested, a little more explanation is needed in the PEP. FYI, note that str/unicode already has another value-dependent exception with %c. I find the message surprising, as I wasn't aware Python had a 'char' type: '%c' % 'a' 'a' '%c' % 'abc' Traceback (most recent call last): File "", line 1, in TypeError: %c requires int or char Python doesn't have a char type, it has str's of length 1... which are usually referred to as char's. ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 16 Jan 2014 11:45, "Carl Meyer" wrote: > > Hi Ethan, > > I haven't chimed into this discussion, but the direction it's headed > recently seems right to me. Thanks for putting together a PEP. Some > comments on it: > > On 01/15/2014 05:13 PM, Ethan Furman wrote: > > > > Abstract > > > > > > This PEP proposes adding the % and {} formatting operations from str to > > bytes [1]. > > I think the PEP could really use a rationale section summarizing _why_ > these formatting operations are being added to bytes; namely that they > are useful when working with various ASCIIish-but-not-properly-text > network protocols and file formats, and in particular when porting code > dealing with such formats/protocols from Python 2. > > Also I think it would be useful to have a section summarizing the > primary objections that have been raised, and why those objections have > been overruled (presuming the PEP is accepted). For instance: the main > objection, AIUI, has been that the bytes type is for pure bytes-handling > with no assumptions about encoding, and thus we should not add features > to it that assume ASCIIness, and that may be attractive nuisances for > people writing bytes-handling code that should not assume ASCIIness but > will once they use the feature. Close, but not quite - the concern was that this was a feature that didn't *inherently* imply a restriction to ASCII compatible data, but only did so when the numeric formatting codes were used. This made it a source of value dependent compatibility errors based on the format string, akin to the kind of value dependent errors seen when implicitly encoding arbitrary text as ASCII. Guido's successful counter was to point out that the parsing of the format string itself assumes ASCII compatible data, thus placing at least the mod-formatting operation in the same category as the currently existing valid-for-sufficiently-ASCII-compatible-data only operations. Current discussions suggest to me that the argument against implicit encoding operations that introduce latent data driven defects may still apply to bytes.format though, so I've reverted to being -1 on that. Cheers, Nick. >And the refutation: that the bytes type > already provides some operations that assume ASCIIness, and these new > formatting features are no more of an attractive nuisance than those; > since the syntax of the formatting mini-languages themselves itself > assumes ASCIIness, there is not likely to be any temptation to use it > with binary data that cannot. > > Although it can be hard to arrive at accurate and agreed-on summaries of > the discussion, recording such summaries in the PEP is important; it may > help save our future selves and colleagues from having to revisit all > these same discussions and megathreads. > > > Overriding Principles > > = > > > > In order to avoid the problems of auto-conversion and value-generated > > exceptions, > > all object checking will be done via isinstance, not by values contained > > in a > > Unicode representation. In other words:: > > > > - duck-typing to allow/reject entry into a byte-stream > > - no value generated errors > > This seems self-contradictory; "isinstance" is type-checking, which is > the opposite of duck-typing. A duck-typing implementation would not use > isinstance, it would call / check for the existence of a certain magic > method instead. > > I think it might also be good to expand (very) slightly on what "the > problems of auto-conversion and value-generated exceptions" are; that > is, that the benefit of Python 3's model is that encoding is explicit, > not implicit, making it harder to unwittingly write code that works as > long as all data is ASCII, but fails as soon as someone feeds in > non-ASCII text data. > > Not everyone who reads this PEP will be steeped in years of discussion > about the relative merits of the Python 2 vs 3 models; it doesn't hurt > to spell out a few assumptions. > > > > Proposed semantics for bytes formatting > > === > > > > %-interpolation > > --- > > > > All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) > > will be supported, and will work as they do for str, including the > > padding, justification and other related modifiers, except locale. > > > > Example:: > > > >>>> b'%4x' % 10 > >b' a' > > > > %c will insert a single byte, either from an int in range(256), or from > > a bytes argument of length 1. > > > > Example: > > > > >>> b'%c' % 48 > > b'0' > > > > >>> b'%c' % b'a' > > b'a' > > > > %s is restricted in what it will accept:: > > > > - input type supports Py_buffer? > > use it to collect the necessary bytes > > > > - input type is something else? > > use its __bytes__ method; if there isn't one, raise an exception [2] > > > > Examples: > > > > >>> b'%s' % b'abc' > > b'abc' > > > > >>> b'%s' % 3.14 > > Trac
Re: [Python-Dev] PEP 461 updates
On 01/15/2014 06:12 PM, Glenn Linderman wrote: On 1/15/2014 4:13 PM, Ethan Furman wrote: - no value generated errors ... %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. what does x = 354 b"%c" % x produce? Seems that construct produces a value dependent error in both python 2 & 3 (although it takes a much bigger value to produce the error in python 3, with str %... with bytes %, the problem with be reached at 256, just like python 2). Is this an intended exception to the overriding principle? Hmm, thanks for spotting that. Yes, that would be a value error if anything over 255 is used, both currently in Py2, and for bytes in Py3. As Carl suggested, a little more explanation is needed in the PEP. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 01/15/2014 05:17 PM, Carl Meyer wrote: I think the PEP could really use a rationale section It will have one before it's done. Also I think it would be useful to have a section summarizing the primary objections that have been raised, and why those objections have been overruled Excellent point. That section will also be present. In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words:: - duck-typing to allow/reject entry into a byte-stream - no value generated errors This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing. Good point, I'll reword that. It will be duck-typing. I think it might also be good to expand (very) slightly on what "the problems of auto-conversion and value-generated exceptions" are Will do. .. [2] TypeError, ValueError, or UnicodeEncodeError? TypeError seems right to me. Definitely not UnicodeEncodeError - refusal to implicitly encode is not at all the same thing as an encoding error. That's the direction I'm leaning, too. Thanks for your comments! -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Glenn Linderman wrote: x = 354 b"%c" % x Is this an intended exception to the overriding principle? I think it's an unavoidable one, unless we want to introduce an "integer in the range 0-255" type. But that would just push the problem into another place, since b"%c" % byte(x) would then blow up on byte(x) if x were out of range. If you really want to make sure it won't crash, you can always do b"%c" % (x & 0xff) or whatever your favourite method of mangling out- of-range ints is. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Surprisingly, in this case the exception is just what the doctor ordered. :-) On Wed, Jan 15, 2014 at 6:12 PM, Glenn Linderman wrote: > On 1/15/2014 4:13 PM, Ethan Furman wrote: > > - no value generated errors > > ... > > > %c will insert a single byte, either from an int in range(256), or from > a bytes argument of length 1. > > > what does > > x = 354 > b"%c" % x > > produce? Seems that construct produces a value dependent error in both > python 2 & 3 (although it takes a much bigger value to produce the error in > python 3, with str %... with bytes %, the problem with be reached at 256, > just like python 2). > > Is this an intended exception to the overriding principle? > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
On 1/15/2014 4:13 PM, Ethan Furman wrote: - no value generated errors ... %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. what does x = 354 b"%c" % x produce? Seems that construct produces a value dependent error in both python 2 & 3 (although it takes a much bigger value to produce the error in python 3, with str %... with bytes %, the problem with be reached at 256, just like python 2). Is this an intended exception to the overriding principle? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 461 updates
Hi Ethan, I haven't chimed into this discussion, but the direction it's headed recently seems right to me. Thanks for putting together a PEP. Some comments on it: On 01/15/2014 05:13 PM, Ethan Furman wrote: > > Abstract > > > This PEP proposes adding the % and {} formatting operations from str to > bytes [1]. I think the PEP could really use a rationale section summarizing _why_ these formatting operations are being added to bytes; namely that they are useful when working with various ASCIIish-but-not-properly-text network protocols and file formats, and in particular when porting code dealing with such formats/protocols from Python 2. Also I think it would be useful to have a section summarizing the primary objections that have been raised, and why those objections have been overruled (presuming the PEP is accepted). For instance: the main objection, AIUI, has been that the bytes type is for pure bytes-handling with no assumptions about encoding, and thus we should not add features to it that assume ASCIIness, and that may be attractive nuisances for people writing bytes-handling code that should not assume ASCIIness but will once they use the feature. And the refutation: that the bytes type already provides some operations that assume ASCIIness, and these new formatting features are no more of an attractive nuisance than those; since the syntax of the formatting mini-languages themselves itself assumes ASCIIness, there is not likely to be any temptation to use it with binary data that cannot. Although it can be hard to arrive at accurate and agreed-on summaries of the discussion, recording such summaries in the PEP is important; it may help save our future selves and colleagues from having to revisit all these same discussions and megathreads. > Overriding Principles > = > > In order to avoid the problems of auto-conversion and value-generated > exceptions, > all object checking will be done via isinstance, not by values contained > in a > Unicode representation. In other words:: > > - duck-typing to allow/reject entry into a byte-stream > - no value generated errors This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing. A duck-typing implementation would not use isinstance, it would call / check for the existence of a certain magic method instead. I think it might also be good to expand (very) slightly on what "the problems of auto-conversion and value-generated exceptions" are; that is, that the benefit of Python 3's model is that encoding is explicit, not implicit, making it harder to unwittingly write code that works as long as all data is ASCII, but fails as soon as someone feeds in non-ASCII text data. Not everyone who reads this PEP will be steeped in years of discussion about the relative merits of the Python 2 vs 3 models; it doesn't hurt to spell out a few assumptions. > Proposed semantics for bytes formatting > === > > %-interpolation > --- > > All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) > will be supported, and will work as they do for str, including the > padding, justification and other related modifiers, except locale. > > Example:: > >>>> b'%4x' % 10 >b' a' > > %c will insert a single byte, either from an int in range(256), or from > a bytes argument of length 1. > > Example: > > >>> b'%c' % 48 > b'0' > > >>> b'%c' % b'a' > b'a' > > %s is restricted in what it will accept:: > > - input type supports Py_buffer? > use it to collect the necessary bytes > > - input type is something else? > use its __bytes__ method; if there isn't one, raise an exception [2] > > Examples: > > >>> b'%s' % b'abc' > b'abc' > > >>> b'%s' % 3.14 > Traceback (most recent call last): > ... > TypeError: 3.14 has no __bytes__ method > > >>> b'%s' % 'hello world!' > Traceback (most recent call last): > ... > TypeError: 'hello world' has no __bytes__ method, perhaps you need > to encode it? > > .. note:: > >Because the str type does not have a __bytes__ method, attempts to >directly use 'a string' as a bytes interpolation value will raise an >exception. To use 'string' values, they must be encoded or otherwise >transformed into a bytes sequence:: > > 'a string'.encode('latin-1') > > format > -- > > The format mini language codes, where they correspond with the > %-interpolation codes, > will be used as-is, with three exceptions:: > > - !s is not supported, as {} can mean the default for both str and > bytes, in both > Py2 and Py3. > - !b is supported, and new Py3k code can use it to be explicit. > - no other __format__ method will be called. > > Numeric Format Codes > > > To properly handle int and float subclasses, int(), index(), and float() > will be called on the > obje
[Python-Dev] PEP 461 updates
Current copy of PEP, many modifications from all the feedback. Thank you to everyone. I know it's been a long week (feels a lot longer!) while all this was hammered out, but I think we're getting close! Abstract This PEP proposes adding the % and {} formatting operations from str to bytes [1]. Overriding Principles = In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words:: - duck-typing to allow/reject entry into a byte-stream - no value generated errors Proposed semantics for bytes formatting === %-interpolation --- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers, except locale. Example:: >>> b'%4x' % 10 b' a' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Example: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s is restricted in what it will accept:: - input type supports Py_buffer? use it to collect the necessary bytes - input type is something else? use its __bytes__ method; if there isn't one, raise an exception [2] Examples: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it? .. note:: Because the str type does not have a __bytes__ method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence:: 'a string'.encode('latin-1') format -- The format mini language codes, where they correspond with the %-interpolation codes, will be used as-is, with three exceptions:: - !s is not supported, as {} can mean the default for both str and bytes, in both Py2 and Py3. - !b is supported, and new Py3k code can use it to be explicit. - no other __format__ method will be called. Numeric Format Codes To properly handle int and float subclasses, int(), index(), and float() will be called on the objects intended for (d, i, u), (b, o, x, X), and (e, E, f, F, g, G). Unsupported codes - %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not supported. !r and !a are not supported. The n integer and float format code is not supported. Open Questions == Currently non-numeric objects go through:: - Py_buffer - __bytes__ - failure Do we want to add a __format_bytes__ method in there? - Guaranteed to produce only ascii (as in b'10', not b'\x0a') - Makes more sense than using __bytes__ to produce ascii output - What if an object has both __bytes__ and __format_bytes__? Do we need to support all the numeric format codes? The floating point exponential formats seem less appropriate, for example. Proposed variations === It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded. It has been suggested to use %b for bytes instead of %s. - Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Footnotes = .. [1] string.Template is not under consideration. .. [2] TypeError, ValueError, or UnicodeEncodeError? == -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com