Re: [Python-Dev] PEP 460 reboot
On 14Jan2014 20:23, Antoine Pitrou wrote: > On Tue, 14 Jan 2014 10:52:05 -0800 > Guido van Rossum wrote: > > Quite a few people have spoken out in favor of loud > > failures rather than silent "wrong" output. But I think that in the > > specific context of formatting output, there is a long and IMO good > > tradition of producing (slightly) wrong output in favor of more > > strict behavior. Consider for example what to do when a number > > doesn't fit in the given width. Would you rather raise an exception, > > truncate the > > value, or mess up the formatting? All languages newer than Fortran > > that I've used have chosen the latter, and I still agree it's a good > > idea. > > Well that's useful when printing out human-readable stuff on stdout, > much less when you're emitting binary data that's supposed to conform > to a well-defined protocol. I expect bytes formatting to be used for > the latter, not the former. I'm 12 hours behind in this thread still, but I'm with Antoine here. With protocols, there's a long and IMO good tradition in the RFC world of being generous in what you accept and conservative in what you send, and writing bytes data constitutes "send" to my mind. While having numbers overflow their widths is (only) often ok for human reports, even that is a PITA for machine parsing later. By way of a text example, my personal bugbear is the UNIX "ps" command in its many flavours. It has fixed width columns with fields that frequently overflow these days, and the overflowing numbers abut each other. Post processing this rubbish is a disaster (I don't want to write "ps", but I have written things that want to read its output). Of course the fix is easy in some ways, use format strings saying "%-5d %-5d %-5d" instead of "%-6d%-6d%-6d". But the authors of ps didn't. And quietly overflowing these fields is exactly what breaks my post processing programs. Morally, this is the same as mojibake. Therefore I am firmly in the "fail loudly" camp: if the format string doesn't behave as you naively expected it to, find out early while you can easily fix it. Cheers, -- Cameron Simpson Motorcycles are like peanuts... who can stop at just one? - Zebee Johnstone aus.motorcycles Poser Permit #1 ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 10:58 PM, Stephen J. Turnbull wrote: At the very least, the "iterated interpolation is a bad idea" misfeature needs to be documented. I'm not sure it needs any extra attention. Even with str, iterated interpolation is tricky -- I've been bitten by it more than once, and that even when I controlled the source! :/ -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan writes: > Yes, I'm currently thinking the appropriate approach to the docs > will be to remove the current "these have most of the str methods > too" paragraph for binary sequences and instead create three > completely explicit lists of methods: > - provided, works with arbitrary data > - provided, assumes the use of an ASCII compatible data format I'm not sure what that means. If you mean that in the format string for .format() and %-formatting, bytes 0-127 must always have ASCII coded character semantics with bytes 128-255 unrestricted, indeed, that is the pragmatic restriction. Is there anything else? The implications of this should be made clear, though: funky Asian encodings cannot be safely used in format strings for format(), GB18030 isn't safe in %-formatting either, and the value returned by these operations should be assumed to be non-ASCII-compatible unless proven otherwise (no iterated formatting). I think you also need - provided, assumes pure ASCII-encoded text since as far as I know the only strictly ASCII-compatible binary formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the characters represented with bytes in the range 128-255 are not handled by bytes versions of the case-checking and case-converting operations, and so have extremely dubious semantics unless the data is pure ASCII. This is also true of most of the is_* operations. Note that .center and .strip have pretty dubious semantics for arbitrary "ASCII-compatible" data: >>> b"abc\r\n".center(15) b' abc\r\n ' >>> " \xA0abc\xA0 ".strip() 'abc' >>> b" \xA0abc\xA0 ".strip() b'\xa0abc\xa0' Of course the case of .center() is purely a programmer error, and I don't have a use case where it's problematic in practice. But it's sort of unpleasant. Although I have internalized Guido's point that what's important is that there be no implicit conversions between bytes and str, I still worry that this slew of subtle semantic differences when moving str methods wholesale to bytes is a bug magnet. I have an especially bad feeling about str-into-bytes interpolation. If people want that, they should use a type like asciistr that provides more or less firm guarantees that the content is pure ASCII. > - not provided > PEP 461 would add a fourth category, of being provided, but with > more restricted semantics. I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to have time this week. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 15 Jan 2014 20:58, "Stephen J. Turnbull" wrote: > > Aside: OK, Guido, ya got me. > > I have a separate screed recounting the reasons for my apostasy, but > that's probably not interesting any more. I'll send it to individuals > on request. > > > But in terms of explaining the text model, that > > separation is important enough that > > > > (1) We should be reluctant to strengthen the > > "its really just ASCII" messages. > > True. I think the right message is is "Unless you know why you > *desperately* want this, not only don't you need it, but using it is > the Python equivalent of skydiving without a parachute." > > N.B. Don't take the metaphor as an insult. I think it's become clear > that those who "desperately want this" not only use parachutes, they > pack their own. No need to worry about them. > > > (2) It *may* be worth creating a virtual > > split in the documentation. > > Please don't. All we need to tell naive users is: > > Look at the structure of the bytes. If that structure is "text", > convert to str using .decode(). Please don't use bytes. > > If that structure isn't text, you're in a specialist domain, and > it's your problem. Many structured uses of bytes use ASCII- > encoded keywords: we provide bytes methods for handling them, but > you *must* be aware that these methods *cannot* distinguish "bytes > representing text encoded as ASCII" from "any old bytes". Be > warned: They will happily -- and silently -- corrupt the latter. > Make sure you respect the higher-level structure of your data when > using them. Yes, I'm currently thinking the appropriate approach to the docs will be to remove the current "these have most of the str methods too" paragraph for binary sequences and instead create three completely explicit lists of methods: - provided, works with arbitrary data - provided, assumes the use of an ASCII compatible data format - not provided PEP 461 would add a fourth category, of being provided, but with more restricted semantics. Cheers, Nick. > > > Virtual subclass ASCIIStructuredBytes > > > > > > One particularly common use of bytes is to represent > > the contents of a file, or of a network message. In > > these cases, the bytes will often represent Text > > *in a specific encoding* and that encoding will usually > > be a superset of ASCII. Rather than create and support > > an ASCIIStructuredBytes subclass, Python simply added > > support for these use cases straight to Bytes objects, > > and assumes that this support simply won't be used when > > when it does not make sense. For example, bytes literals > > This is going quite the wrong direction, I think. The only people who > should care about "Text *in a specific encoding* and that encoding > will usually be a superset of ASCII" are codec writers, and by now > writing those is a very rare task. Everybody else uses ASCII keywords > in a simple formal language. > > > *could* be used to construct a sound sample, but the > > literals will be far easier to read when they are used > > to represent (encoded) ASCII text, such as "OPEN". > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Aside: OK, Guido, ya got me. I have a separate screed recounting the reasons for my apostasy, but that's probably not interesting any more. I'll send it to individuals on request. > But in terms of explaining the text model, that > separation is important enough that > > (1) We should be reluctant to strengthen the > "its really just ASCII" messages. True. I think the right message is is "Unless you know why you *desperately* want this, not only don't you need it, but using it is the Python equivalent of skydiving without a parachute." N.B. Don't take the metaphor as an insult. I think it's become clear that those who "desperately want this" not only use parachutes, they pack their own. No need to worry about them. > (2) It *may* be worth creating a virtual > split in the documentation. Please don't. All we need to tell naive users is: Look at the structure of the bytes. If that structure is "text", convert to str using .decode(). Please don't use bytes. If that structure isn't text, you're in a specialist domain, and it's your problem. Many structured uses of bytes use ASCII- encoded keywords: we provide bytes methods for handling them, but you *must* be aware that these methods *cannot* distinguish "bytes representing text encoded as ASCII" from "any old bytes". Be warned: They will happily -- and silently -- corrupt the latter. Make sure you respect the higher-level structure of your data when using them. > Virtual subclass ASCIIStructuredBytes > > > One particularly common use of bytes is to represent > the contents of a file, or of a network message. In > these cases, the bytes will often represent Text > *in a specific encoding* and that encoding will usually > be a superset of ASCII. Rather than create and support > an ASCIIStructuredBytes subclass, Python simply added > support for these use cases straight to Bytes objects, > and assumes that this support simply won't be used when > when it does not make sense. For example, bytes literals This is going quite the wrong direction, I think. The only people who should care about "Text *in a specific encoding* and that encoding will usually be a superset of ASCII" are codec writers, and by now writing those is a very rare task. Everybody else uses ASCII keywords in a simple formal language. > *could* be used to construct a sound sample, but the > literals will be far easier to read when they are used > to represent (encoded) ASCII text, such as "OPEN". ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
> Right, that's the danger I was worried about, but the problem is that > there's at least *some* minimum level of ASCII compatibility that > needs to be assumed in order to define an interpolation format at all > (this is the point I originally missed). Only if you insist that bytes formats be admitted. But that's an implementation detail, really. (I'm not going to push that point, since it's the obvious way to request a bytes result, and insist on the various restrictions and semantic differences proposed for bytes interpolation -- anything else would be silly.) More seriously, it's irrelevant *post*-interpolation, because by definition bytes interpolation interpolates bytes, not "ASCII compatible". So what you're saying is iterated interpolation is crazy: width1, width2 = compute_column_widths(table_rows) fmt = b"%%%ds %%%ds\n" % (width1, width2) for row in table_rows: print(fmt % row)# might be useful in debugging ;-) # writing to a file is plausible IMO Tell me again why we have a '%%' format code? :-) > (which must make life interesting if you try to use an ASCII > incompatible coding cookie for your source code - I'm actually not > sure what the full implications of that *are* for bytes literals in > Python 3). Currently None: me 15:46$ python3.3 test.py File "test.py", line 2 SyntaxError: bytes can only contain ASCII literal characters. :-) > It's certainly a decision that has its downsides, with the potential > impact on users of ASCII incompatible encodings (mostly in Asia) Which is most of the world at this point. You ISO-8859-speakers are gonna wither away! :-) Nor do I think there's anybody crazy enough to make a Tiananmen Square-style stand against GB18030. In 2025 this could be Python's most sensitive Achilles' heel. Hm. Maybe I should put a fractional coefficient on that . > being the main one, but I think the increased convenience in > working with ASCII compatible binary protocols and file formats is > worth the cost. But there aren't any ASCII-compatible binary protocols in the sense that Shift JIS is *not* ASCII compatible. After interpolation, you end up with something that's not ASCII compatible. At the very least, the "iterated interpolation is a bad idea" misfeature needs to be documented. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
I am exhausted from all these discussions. I just recommend not touching those docs. On Tue, Jan 14, 2014 at 8:08 PM, Jim Jewett wrote: > On Tue, Jan 14, 2014 at 3:06 PM, Guido van Rossum wrote: >> Personally I wouldn't add any words suggesting or referring to the >> option of creation another class for this purpose. You wouldn't >> recommend subclassing dict for constraining the types of keys or >> values, would you? > > Yes, and it is so clear that I suspect I'm missing some context for > your question. > > Do I recommend that each individual application should create new > concrete classes instead of just using the builtins? No. > > When trying to understand (learn about) the text/binary distinction, I > do recommend pretending that they are represented by separate classes. > Limits on the values in a bytearray are NOT the primary reason for > this; the primary reason is that operations like the literal > representation or the capitalize method are arbitrary nonsense unless > the data happens to be representing ASCII. > > sound_sample.capitalize() -- syntactically valid, but semantic garbage > header.capitalize() -- OK, which implies that data is an instance > of something more specific than bytes. > > Would I recommend subclassing dict if I wanted to constrain the key > types? Yes -- though MutableMapping (fewer gates to guard) or the > upcoming TransformDict would probably be better still. > > The existing dict implementation itself effectively uses (hidden, > quasi-)subclasses to restrict types of keys strictly for efficiency. > (lookdict* variants) > > -jJ > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 3:06 PM, Guido van Rossum wrote: > Personally I wouldn't add any words suggesting or referring to the > option of creation another class for this purpose. You wouldn't > recommend subclassing dict for constraining the types of keys or > values, would you? Yes, and it is so clear that I suspect I'm missing some context for your question. Do I recommend that each individual application should create new concrete classes instead of just using the builtins? No. When trying to understand (learn about) the text/binary distinction, I do recommend pretending that they are represented by separate classes. Limits on the values in a bytearray are NOT the primary reason for this; the primary reason is that operations like the literal representation or the capitalize method are arbitrary nonsense unless the data happens to be representing ASCII. sound_sample.capitalize() -- syntactically valid, but semantic garbage header.capitalize() -- OK, which implies that data is an instance of something more specific than bytes. Would I recommend subclassing dict if I wanted to constrain the key types? Yes -- though MutableMapping (fewer gates to guard) or the upcoming TransformDict would probably be better still. The existing dict implementation itself effectively uses (hidden, quasi-)subclasses to restrict types of keys strictly for efficiency. (lookdict* variants) -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14Jan2014 11:43, Jim Jewett wrote: > Greg Ewing replied: > >> ... ASCII compatible binary data is a > >> *subset* of arbitrary binary data. > > I wrote: [...] > >(2) It *may* be worth creating a virtual > > split in the documentation. [...] > > Ethan likes the idea, but points out that the term > "Virtual" is confusing here. [...] > (A) What word should I use instead of "Virtual"? > Imaginary? Pretend? I'd title it in terms of a common use case, not a "virtual class". You even phrase the opening sentence as a use case already. > (B) Would it be good/bad/at least make the docs > easier to create an actual class (or alias)? > (C) Same question for a pair of classes provided > only in the documentation, like example code. I don't think so. People might use it:-( [...] > > A Bytes object could represent anything, [...] Tiny nit: shouldn't that be "bytes", not "Bytes"? > > appropriate as the underlying storage for a sound sample > > or image file. > > > > Virtual subclass ASCIIStructuredBytes > > Possible alternate title: Common use case: bytes containing text sequences, especially ASCII Cheers, -- Cameron Simpson I think... Therefore I ride. I ride... Therefore I am. - Mark Pope ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/14/2014 10:11 AM, Jim J. Jewett wrote: Virtual subclass ASCIIStructuredBytes You would first have to define what you meant by a virtual subclass, and that somewhere would have to be linked every place you use the term, because it is a new term. Why not just call the sections of the documentation where ASCII-supporting features of bytes are discussed "Special ASCII support". Calling it that will make it clear that if you are not using ASCII, you need to be careful of using the feature... or contrariwise, that if you are using the feature, you need to be using ASCII. While some ASCII supersets may also be usable with the features, I don't think that should be emphasized in anyway, unless there is specific support for particular ASCII supersets. Using ASCII supersets should be "buyer beware". The whole b"%s" interpolation feature would, appropriately, be described in such a section. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Guido van Rossum wrote: Quite a few people have spoken out in favor of loud failures rather than silent "wrong" output. But I think that in the specific context of formatting output, there is a long and IMO good tradition of producing (slightly) wrong output in favor of more strict behavior. Consider for example what to do when a number doesn't fit in the given width. Would you rather raise an exception, truncate the value, or mess up the formatting? That depends on the context. If the output is simply a text file whose lines can grow to accommodate the extra width, messing up the formatting probably okay. If it's going into a printed report with a strictly limited width for each column, and anything that doesn't fit is going to get graphically clipped away, with no visual indication that this has happened, it's NOT okay. If it's going into a text file with defined columns for each field, which will be read by something that assumes certain things are in certain columns, it's NOT okay. If it's going into a binary file as a field consisting of a length byte followed by some chars, messing up the formatting is DEFINITELY NOT okay. This latter kind of situation is the one we're talking about. If you do something like b"%c%s" % (len(data), data) and data is a str, then the length byte will be correct, but the data will be (at least) 3 bytes too long. Whatever reads the file then gets out of step at that point, and all hell breaks loose. You do *not* get a nice, easy-to-debug symptom from this kind of thing. You get "Something is wrong somewhere in this 50 megabyte jpg file, good luck on finding out what and why". -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Ethan Furman wrote: On 01/14/2014 10:11 AM, Jim J. Jewett wrote: But in terms of explaining the text model, that separation is important enough that (2) It *may* be worth creating a virtual split in the documentation. I think (2) is a great idea. I don't think it's such a great idea to belabour this point. The notion of an ASCIIStructuredBytes type seems to assume that you have *either* ascii-encoded text *or* some other kind of data. But many of the use cases for all of this involve constructing a single object, parts of which are one and parts of which are another. It's hard to think of that in terms of virtual classes unless you're willing to imagine that different parts of the same object are of different types, which, for a primitive object like bytes, doesn't make sense in the context of the Python object model. By all means point out that the ascii features of bytes are intended for use on data that happens to be ascii, and shouldn't be used otherwise. But I think that talking about "virtual classes" just risks confusing people, particulary when we have ABCs, which are also a kind of virtual class represented by real class objects. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/14/2014 4:46 AM, Nick Coghlan wrote: The one remaining way I could potentially see a formatb method working is along the lines of what Glenn (I think) suggested: just like struct definitions, the formatb specifier would have to consist*solely* of substitution fields. However, that's getting awfully close to being just an alternate spelling for the struct module or bytes.join at that point, which hardly makes for a compelling case to add two new methods to a builtin type. Yes, after someone drew the parallel between my "format specifier only" pedantry, and struct.pack (which I hadn't used), I agree that they are almost just different spellings for the same things. The two differences I could see is that struct.pack doesn't support variable length items, and struct.pack doesn't support "interpolation", which is the whole beauty of the % type syntax... the ability to have a template, and interpolate values. My pedantry DID allow for template work, but they had to be specified in HEX the way I specified it yesterday. Let me repeat that syntax: b"%{hex-codes}v" That was mostly so the format string could be ASCII, yet represent any byte. That is somewhat clunky, when actually wanting to represent characters. At the next level of abstraction, one could define a "format builder" that would take Unicode specifications, and "compile" them into the binary interpolation strings, but if doing that, you could just as well compile them into functions using struct.pack formats, with the parameters interspersed with the "template" data, except for struct.pack's inability to deal with variable length data. So struct is attempting to emulate C structs, and variable length data is extremely awkward in C structs also, so I guess it provides a good emulation :) So if I were to look for features to add to Python3 to support template interpolation for users of non-ASCII encodings, which could, of course, also be used by users of ASCII-based encodings, I guess I would recommend: 1) extend struct to handle variable length data items 2) provide a sample format compiler function that would translate a Unicode format description into a function that would use struct.pack, and pre-encode (according to the format specification) the template parts into parameters for struct.pack). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Let me answer you both since the issues are related. On 1/14/2014 7:46 AM, Nick Coghlan wrote: Guido van Rossum writes: > And that is precisely my point. When you're using a format string, Bytes interpolation uses a bytes format, or a byte string if you will, but it should not be thought of as a character or text string. Certain bytes (123 and 125) delimit a replacement field. The bytes in between define, in my version, a format-spec after being ascii-decoded to text for input to 3.x format(). The decoding and subsequent encoding would not be needed if 2.7 format(ob, byte-spec) were available. > all of the format string (not just the part between { and }) had > better use ASCII or an ASCII superset. I am not even sure what you mean here. The bytes outside of 123 and 125 are simply copied to the output string. There is no encoding or interpretation involved. It is true that the uninterpred bytes best not contain a byte pattern mistakenly recognized as a replacement field. I plan to refine the relational expression byte pattern used in byteformat to sharply reduce the possibility of such errors. When such errors happen anyway, an exception should be raised, and I plan to expand the error message to give more diagnostic information. And this (rightly) constrains the output to an ASCII superset as well. What does this mean? I suspect I disagree. The bytes interpolated into the output bytes can be any bytes. Except that if you interpolate something like Shift JIS, Bytes interpolation interpolates bytes, not encodings. A self-identifying byte stream starts with bytes in a known encoding that specifies the encoding of the rest of the stream. Neither part need be encoded text. (Would that something like were standard for encoded text streams, as well as for serialized images.) >> [snip] Right, that's the danger I was worried about, but the problem is that there's at least *some* minimum level of ASCII compatibility that needs to be assumed in order to define an interpolation format at all (this is the point I originally missed). I would put this sightly differently. To process bytes, we may define certain bytes as metabytes with a special meaning. We may choose the bytes that happen to be the ascii encoding of certain characters. But once the special numbers are chosen, they are numbers, not characters. The problem of metabytes having both a normal and special meaning is similar to the problem of metacharacters having both a normal and special meaning. For printf-style formatting, it's % along with the various formatting characters and other syntax (like digits, parentheses, variable names and "."), with the format method it's braces, brackets, colons, variable names, etc. It is the bytes corresponding to these characters. This is true also of the metabytes in an re module bytes pattern. The mini-language parser has to assume in encoding > in order to interpret the format string, This is where I disagree with you and Guido. Bytes processing is done with numbers 0 <= n <= 255, not characters. The fact that ascii characters can, for convenience, be used in bytes literals to indicate the corresponding ascii codes does not change this. A bytes parser looks for certain special numbers. Other numbers need not be given any interpretation and need not represent encoded characters. > and that's *all* done assuming an ASCII compatible format string Since any bytes can be be regarded as an ascii-compatible latin-1 encoded string, that seems like a vacuous assumption. In any case, I do not seen any particular assumption in the following, other than the choice of replacement field delimiters. >>> list(byteformat(bytes([1,2,10, 123, 125, 200]), (bytes([50, 100, 150]),))) [1, 2, 10, 50, 100, 150, 200] > (which must make life interesting if you try to use an ASCII incompatible coding cookie for your source code - I'm actually not sure what the full implications of that *are* for bytes literals in Python 3). An interesting and important question. The Python 2 manual says that the coding cookie applies to only to comments and strings. To me, this suggests that any encoding can be used. I am not sure how and when the encoding is applied. It suggests that the sequence of bytes resulting from a string literal is not determined by the sequence of characters comprising the string literal, but also depends on the coding cookie. The Python 3 manual says that the coding cookie applies to the whole source file. To me, this says that the subset of unicode chars included in the encoding *must* include the ascii characters. It also suggest to me that the encoding must also ascii-compatible, in order to read the encoding in the ascii-text coding cookie (unless there is a fallback to the system encoding). In any case, a 3.x source file is decoded to unicode. When the sequence of unicode chars comprising a bytes literal is interpreted, the re
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 01:17 PM, Mark Lawrence wrote: On 14/01/2014 20:54, Guido van Rossum wrote: On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % data_value[:15] everywhere. Wow. I thought there would be some combination using %.15s but I can't get that to work. :-( I believe you wanted this. a='01234567890123456' len(a) 17 b = '%15.15s' % a b;len(b) '012345678901234' 15 Cool! -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 2014-01-14 20:54, Guido van Rossum wrote: On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: On 01/14/2014 10:52 AM, Guido van Rossum wrote: Which reminds me. Quite a few people have spoken out in favor of loud failures rather than silent "wrong" output. But I think that in the specific context of formatting output, there is a long and IMO good tradition of producing (slightly) wrong output in favor of more strict behavior. Consider for example what to do when a number doesn't fit in the given width. Would you rather raise an exception, truncate the value, or mess up the formatting? One more data point to consider: When the binary format has strict rules on how much space a data-point is allowed, then failure is the only appropriate option. Yes, that's how the struct module works. In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % data_value[:15] everywhere. Wow. I thought there would be some combination using %.15s but I can't get that to work. :-( I've not sure what you mean here: Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win 32 Type "help", "copyright", "credits" or "license" for more information. >>> import string >>> '%.15s' % string.letters 'abcdefghijklmno' >>> len(_) 15 I'm not suggesting we change how that portion works, as it would then be, I think, too different from both Py2 behavior as well as current str behavior, but likewise adding in single quotes would of no help to me. Loud failure so I can easily see where I forgot the .encode() would be much more helpful. If we go with a more restricted version this makes sense indeed. The single quotes seemed unavoidable when I was trying (like several other proposals) to have a format code that works for all types. I think we're rightly giving up on that now. (I should review PEP 461, but I don't have time yet.) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 01:15 PM, Eric V. Smith wrote: On 1/14/2014 3:54 PM, Guido van Rossum wrote: On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % data_value[:15] everywhere. Wow. I thought there would be some combination using %.15s but I can't get that to work. :-( '%.15s' % 'abcdefghij1234567' 'abcdefghij12345' '{:.15}'.format('abcdefghij1234567') 'abcdefghij12345' Or, depending on what you're after: '%15.15s' % 'abcde' ' abcde' '%15.15s' % 'abcdefghij1234567' 'abcdefghij12345' Huh. Wish I'd known about that way back when! ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 14, 2014, at 10:52 AM, Guido van Rossum wrote: >Which reminds me. Quite a few people have spoken out in favor of loud >failures rather than silent "wrong" output. But I think that in the >specific context of formatting output, there is a long and IMO good >tradition of producing (slightly) wrong output in favor of more strict >behavior. In the email package we now have a tradition of allowing either behavior. http://docs.python.org/3.4/library/email.policy.html#email.policy.Policy.raise_on_defect Perhaps not appropriate for the PEP 460 related cases, but I think the policy mechanism works great for email parsing, where sometimes you definitely want to fail early (e.g. you are composing new messages out of literal strings) and other times where you are willing to put up with some best-effort representation in exchange for no exceptions being raised (e.g. you are parsing messages being fed to you from your mail server). -Barry ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: The mini-language parser has to assume in encoding in order to interpret the format string, and that's *all* done assuming an ASCII compatible format string (which must make life interesting if you try to use an ASCII incompatible coding cookie for your source code I don't think it's all *that* interesting. As long as you're able to type the relevant characters on your keyboard and have them displayed in a recognisable way in your editor, then what looks like b"Content-Length: %d" in your source will end up encoded as ascii in the bytes object, whatever the encoding of the source file. If the source file uses an encoding that can't even represent the formatting characters, then you're in trouble -- but you'd have a hard time writing Python code at all in such an environment! It's certainly a decision that has its downsides, with the potential impact on users of ASCII incompatible encodings (mostly in Asia) being the main one, I don't think it will have much impact on them, other than maybe they will find less use cases for it. But the main intended use cases are for things like http headers which have protocol-mandated ascii-ish bits, and those bits are still just as ascii-ish in China as they are anywhere else. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14/01/2014 20:54, Guido van Rossum wrote: On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % data_value[:15] everywhere. Wow. I thought there would be some combination using %.15s but I can't get that to work. :-( I believe you wanted this. >>> a='01234567890123456' >>> len(a) 17 >>> b = '%15.15s' % a >>> b;len(b) '012345678901234' 15 -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/14/2014 3:54 PM, Guido van Rossum wrote: > On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: >> In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' >> % data_value[:15] everywhere. > > Wow. I thought there would be some combination using %.15s but I can't > get that to work. :-( >>> '%.15s' % 'abcdefghij1234567' 'abcdefghij12345' >>> '{:.15}'.format('abcdefghij1234567') 'abcdefghij12345' >>> Or, depending on what you're after: >>> '%15.15s' % 'abcde' ' abcde' >>> '%15.15s' % 'abcdefghij1234567' 'abcdefghij12345' >>> > (I should review PEP 461, but I don't have time yet.) Same here. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman wrote: > On 01/14/2014 10:52 AM, Guido van Rossum wrote: >> >> Which reminds me. Quite a few people have spoken out in favor of loud >> failures rather than silent "wrong" output. But I think that in the >> specific context of formatting output, there is a long and IMO good >> tradition of producing (slightly) wrong output in favor of more strict >> behavior. Consider for example what to do when a number doesn't fit in >> the given width. Would you rather raise an exception, truncate the >> value, or mess up the formatting? > > One more data point to consider: When the binary format has strict rules on > how much space a data-point is allowed, then failure is the only appropriate > option. Yes, that's how the struct module works. > In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' > % data_value[:15] everywhere. Wow. I thought there would be some combination using %.15s but I can't get that to work. :-( > I'm not suggesting we change how that portion works, as it would then be, I > think, too different from both Py2 behavior as well as current str behavior, > but likewise adding in single quotes would of no help to me. Loud failure > so I can easily see where I forgot the .encode() would be much more helpful. If we go with a more restricted version this makes sense indeed. The single quotes seemed unavoidable when I was trying (like several other proposals) to have a format code that works for all types. I think we're rightly giving up on that now. (I should review PEP 461, but I don't have time yet.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 10:52 AM, Guido van Rossum wrote: Which reminds me. Quite a few people have spoken out in favor of loud failures rather than silent "wrong" output. But I think that in the specific context of formatting output, there is a long and IMO good tradition of producing (slightly) wrong output in favor of more strict behavior. Consider for example what to do when a number doesn't fit in the given width. Would you rather raise an exception, truncate the value, or mess up the formatting? One more data point to consider: When the binary format has strict rules on how much space a data-point is allowed, then failure is the only appropriate option. In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % data_value[:15] everywhere. I'm not suggesting we change how that portion works, as it would then be, I think, too different from both Py2 behavior as well as current str behavior, but likewise adding in single quotes would of no help to me. Loud failure so I can easily see where I forgot the .encode() would be much more helpful. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 12:04 PM, Eric V. Smith wrote: > On 01/14/2014 01:52 PM, Guido van Rossum wrote: > >> But the way to arrive at this behavior without duplicating a whole lot >> of code seems to be to call the existing text-based __format__ API and >> convert the result to bytes -- for numbers this should be safe (their >> formatting produces just ASCII digits and a selected few other ASCII >> characters) but leads to an undesirable outcome for other types -- not >> just str but also e.g. lists or dicts containing str instances, since >> those call __repr__ on the contained items, and repr() may produce >> non-ASCII bytes. > > That's why I suggested restricting the types supported. If we could live > with just a subset of known types, then we could hard-code the > conversions to bytes. How many types with custom __format__'s are really > getting written to byte strings in 2.x? For that matter, are any lists, > sets, or dicts (or anything else using object.__format__'s conversion > using str()) really getting written to bytes? Do we need to support > these cases? > > In my mind, this comes down to: are we trying to add this just to make > porting easier? In my mind, we wouldn't even be adding feature at all > except for ease of porting 2.x code. So we should focus on what features > are used in the code we're trying to port. I don't think our focus is on > 2.x code that's using u''.format(), it's 2.x code that's been reviewed > and is still using b''.format() because it's building up bytes for a > wire protocol. And that code is not likely to need to format objects > with arbitrary __format__ methods, or even str (in the 3.x sense). It's > only likely to use numbers and bytes (or str in the 2.x sense). Yes, these are exactly the right questions to ask. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Personally I wouldn't add any words suggesting or referring to the option of creation another class for this purpose. You wouldn't recommend subclassing dict for constraining the types of keys or values, would you? On Tue, Jan 14, 2014 at 11:43 AM, Jim J. Jewett wrote: > > > > Greg Ewing replied: > >>> ... ASCII compatible binary data is a >>> *subset* of arbitrary binary data. > > I wrote: > >> But in terms of explaining the text model, that >> separation is important enough that > >>(2) It *may* be worth creating a virtual >> split in the documentation. > > (rough sketch below) > > Ethan likes the idea, but points out that the term > "Virtual" is confusing here. > > Alas, I'm not sure what the correct term is. In > addition to "Go for it!" / "Don't waste your time", > I'm looking for advice on: > > (A) What word should I use instead of "Virtual"? > Imaginary? Pretend? > > (B) Would it be good/bad/at least make the docs > easier to create an actual class (or alias)? > > (C) Same question for a pair of classes provided > only in the documentation, like example code. > > (D) What about an abstract class, or several? > > e.g., replacing the XXX TODO of collections.abc.ByteString > with separate abstract classes for ByteSequence, String, > ByteString, and ASCIIByteString? > > (ByteString already includes any bytes or bytearray instance, > so backward compatibility means the String suffix isn't > sufficient for an opt-in-by-instances class.) > > >> I'm willing ot work on (2) if there is general consensus >> that it would be a good idea. As a rough sketch, I >> would change places like >> >> http://docs.python.org/3/library/stdtypes.html#typebytes >> >> from: >> >> Bytes objects are immutable sequences of single bytes. >> Since many major binary protocols are based on the ASCII >> text encoding, bytes objects offer several methods that >> are only valid when working with ASCII compatible data >> and are closely related to string objects in a variety >> of other ways. >> >> to something more like: >> >> Bytes objects are immutable sequences of single bytes. >> >> A Bytes object could represent anything, and is >> appropriate as the underlying storage for a sound sample >> or image file. >> >> Virtual subclass ASCIIStructuredBytes >> >> >> One particularly common use of bytes is to represent >> the contents of a file, or of a network message. In >> these cases, the bytes will often represent Text >> *in a specific encoding* and that encoding will usually >> be a superset of ASCII. Rather than create and support >> an ASCIIStructuredBytes subclass, Python simply added >> support for these use cases straight to Bytes objects, >> and assumes that this support simply won't be used when >> when it does not make sense. For example, bytes literals >> *could* be used to construct a sound sample, but the >> literals will be far easier to read when they are used >> to represent (encoded) ASCII text, such as "OPEN". > > > -jJ > > -- > > If there are still threading problems with my replies, please > email me with details, so that I can try to resolve them. -jJ > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 01:52 PM, Guido van Rossum wrote: > But the way to arrive at this behavior without duplicating a whole lot > of code seems to be to call the existing text-based __format__ API and > convert the result to bytes -- for numbers this should be safe (their > formatting produces just ASCII digits and a selected few other ASCII > characters) but leads to an undesirable outcome for other types -- not > just str but also e.g. lists or dicts containing str instances, since > those call __repr__ on the contained items, and repr() may produce > non-ASCII bytes. That's why I suggested restricting the types supported. If we could live with just a subset of known types, then we could hard-code the conversions to bytes. How many types with custom __format__'s are really getting written to byte strings in 2.x? For that matter, are any lists, sets, or dicts (or anything else using object.__format__'s conversion using str()) really getting written to bytes? Do we need to support these cases? In my mind, this comes down to: are we trying to add this just to make porting easier? In my mind, we wouldn't even be adding feature at all except for ease of porting 2.x code. So we should focus on what features are used in the code we're trying to port. I don't think our focus is on 2.x code that's using u''.format(), it's 2.x code that's been reviewed and is still using b''.format() because it's building up bytes for a wire protocol. And that code is not likely to need to format objects with arbitrary __format__ methods, or even str (in the 3.x sense). It's only likely to use numbers and bytes (or str in the 2.x sense). Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Greg Ewing replied: >> ... ASCII compatible binary data is a >> *subset* of arbitrary binary data. I wrote: > But in terms of explaining the text model, that > separation is important enough that >(2) It *may* be worth creating a virtual > split in the documentation. (rough sketch below) Ethan likes the idea, but points out that the term "Virtual" is confusing here. Alas, I'm not sure what the correct term is. In addition to "Go for it!" / "Don't waste your time", I'm looking for advice on: (A) What word should I use instead of "Virtual"? Imaginary? Pretend? (B) Would it be good/bad/at least make the docs easier to create an actual class (or alias)? (C) Same question for a pair of classes provided only in the documentation, like example code. (D) What about an abstract class, or several? e.g., replacing the XXX TODO of collections.abc.ByteString with separate abstract classes for ByteSequence, String, ByteString, and ASCIIByteString? (ByteString already includes any bytes or bytearray instance, so backward compatibility means the String suffix isn't sufficient for an opt-in-by-instances class.) > I'm willing ot work on (2) if there is general consensus > that it would be a good idea. As a rough sketch, I > would change places like > > http://docs.python.org/3/library/stdtypes.html#typebytes > > from: > > Bytes objects are immutable sequences of single bytes. > Since many major binary protocols are based on the ASCII > text encoding, bytes objects offer several methods that > are only valid when working with ASCII compatible data > and are closely related to string objects in a variety > of other ways. > > to something more like: > > Bytes objects are immutable sequences of single bytes. > > A Bytes object could represent anything, and is > appropriate as the underlying storage for a sound sample > or image file. > > Virtual subclass ASCIIStructuredBytes > > > One particularly common use of bytes is to represent > the contents of a file, or of a network message. In > these cases, the bytes will often represent Text > *in a specific encoding* and that encoding will usually > be a superset of ASCII. Rather than create and support > an ASCIIStructuredBytes subclass, Python simply added > support for these use cases straight to Bytes objects, > and assumes that this support simply won't be used when > when it does not make sense. For example, bytes literals > *could* be used to construct a sound sample, but the > literals will be far easier to read when they are used > to represent (encoded) ASCII text, such as "OPEN". -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 1:52 PM, Guido van Rossum wrote: > On Tue, Jan 14, 2014 at 9:45 AM, Chris Barker wrote: >> On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov >> wrote: >>> >>> - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. >> >> >> please no -- that's the source of a lot of pain in py2 now. >> >> having a failure as a result of the value, rather than the type, of an >> object just makes hard-to-test for bugs. Everything will be hunky dory for >> development and testing, then in deployment some idiot ( ;-) ) will pass in >> some non-ascii compatible string and you get failure. And the person who >> gets the failure doesn't understand why, or they wouldn't have passed in >> non-ascii values in the first place... >> >> Ease of porting is nice, but let's not make it easy to port bug-prone code. > > Right. This is a big red flag to me as well. > > I think there is some inherent conflict between the extensible design > of str.format() and the practical needs of people who are actually > going to use formatting operations (either % or .format()) with bytes. > > The *practical* needs are mostly limited to supporting basic number > formatting (decimal, hex, padding) and interpolation of anything that > supports the buffer interface. It would also be nice if you didn't > have to specify the type at all in the format string, i.e. {} should > do the right thing for numbers and (all sorts of) bytes. > > But the way to arrive at this behavior without duplicating a whole lot > of code seems to be to call the existing text-based __format__ API and > convert the result to bytes -- for numbers this should be safe (their > formatting produces just ASCII digits and a selected few other ASCII > characters) but leads to an undesirable outcome for other types -- not > just str but also e.g. lists or dicts containing str instances, since > those call __repr__ on the contained items, and repr() may produce > non-ASCII bytes. > > This is why my earlier proposal used ascii(), which is a "nerfed"(*) > version of repr(). This does the right thing for numbers as well as > for many other types (e.g. None, bool) and does something unpleasant > for text strings that is perhaps better than the alternative. > > Which reminds me. Quite a few people have spoken out in favor of loud > failures rather than silent "wrong" output. But I think that in the > specific context of formatting output, there is a long and IMO good > tradition of producing (slightly) wrong output in favor of more strict > behavior. Consider for example what to do when a number doesn't fit in > the given width. Would you rather raise an exception, truncate the > value, or mess up the formatting? All languages newer than Fortran > that I've used have chosen the latter, and I still agree it's a good > idea. Similar with infinities, NaN, or None. (Yes, it's embarrassing > to have a website displaying 'null'. But isn't a 500 even *more* > embarrassing?) > > This doesn't mean I'm insensitive to the argument in favor of loud and > early failure. It's just that I can see both sides of the coin, and > I'm still deciding which argument is more important. > > (*) Gamer slang for a weapon made less dangerous. :-) I think loud and early failure is important for porting while you might still be trying to pound out the previously blurry encode/decode boundaries. In this code str and bytes will be wrong everywhere. Some APIs might return either str or bytes based on the input. Let it fail, find the boundaries, and fix it until it does something useful without failing. And it kindof depends on the context whether it is worse to display weird ephemeral output or write the same weird output to long term storage. I'm not sure what to think about content-dependent failures on protocols that are supposed to be ASCII-only-without-repr-noise. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, 14 Jan 2014 10:52:05 -0800 Guido van Rossum wrote: > Would you rather raise an exception, truncate the > value, or mess up the formatting? All languages newer than Fortran > that I've used have chosen the latter, and I still agree it's a good > idea. Well that's useful when printing out human-readable stuff on stdout, much less when you're emitting binary data that's supposed to conform to a well-defined protocol. I expect bytes formatting to be used for the latter, not the former. (which also means, actually, that I don't think the fancy formatting features - alignment, etc. - are useful at all for bytes; but it's probably ok having them for consistency) > Similar with infinities, NaN, or None. (Yes, it's embarrassing > to have a website displaying 'null'. But isn't a 500 even *more* > embarrassing?) When it comes to type mismatch, though, an error is raised: >>> "%d" % object() Traceback (most recent call last): File "", line 1, in TypeError: %d format: a number is required, not object (instead of outputting e.g. repr(id(x))) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/14/2014 1:11 PM, Jim J. Jewett wrote: But in terms of explaining the text model, that separation is important enough that (1) We should be reluctant to strengthen the "its really just ASCII" messages. (2) It *may* be worth creating a virtual split in the documentation. I'm willing ot work on (2) if there is general consensus that it would be a good idea. As a rough sketch, I would change places like http://docs.python.org/3/library/stdtypes.html#typebytes from: Bytes objects are immutable sequences of single bytes. Since many major binary protocols are based on the ASCII text encoding, bytes objects offer several methods that are only valid when working with ASCII compatible data and are closely related to string objects in a variety of other ways. to something more like: Bytes objects are immutable sequences of single bytes. A Bytes object could represent anything, and is appropriate as the underlying storage for a sound sample or image file. Virtual subclass ASCIIStructuredBytes One particularly common use of bytes is to represent the contents of a file, or of a network message. In these cases, the bytes will often represent Text *in a specific encoding* and that encoding will usually be a superset of ASCII. Rather than create and support an ASCIIStructuredBytes subclass, Python simply added support for these use cases straight to Bytes objects, and assumes that this support simply won't be used when when it does not make sense. For example, bytes literals *could* be used to construct a sound sample, but the literals will be far easier to read when they are used to represent (encoded) ASCII text, such as "OPEN". I rather like this. Consider opening a tracker issue. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 9:45 AM, Chris Barker wrote: > On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov > wrote: >> >> - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. > > > please no -- that's the source of a lot of pain in py2 now. > > having a failure as a result of the value, rather than the type, of an > object just makes hard-to-test for bugs. Everything will be hunky dory for > development and testing, then in deployment some idiot ( ;-) ) will pass in > some non-ascii compatible string and you get failure. And the person who > gets the failure doesn't understand why, or they wouldn't have passed in > non-ascii values in the first place... > > Ease of porting is nice, but let's not make it easy to port bug-prone code. Right. This is a big red flag to me as well. I think there is some inherent conflict between the extensible design of str.format() and the practical needs of people who are actually going to use formatting operations (either % or .format()) with bytes. The *practical* needs are mostly limited to supporting basic number formatting (decimal, hex, padding) and interpolation of anything that supports the buffer interface. It would also be nice if you didn't have to specify the type at all in the format string, i.e. {} should do the right thing for numbers and (all sorts of) bytes. But the way to arrive at this behavior without duplicating a whole lot of code seems to be to call the existing text-based __format__ API and convert the result to bytes -- for numbers this should be safe (their formatting produces just ASCII digits and a selected few other ASCII characters) but leads to an undesirable outcome for other types -- not just str but also e.g. lists or dicts containing str instances, since those call __repr__ on the contained items, and repr() may produce non-ASCII bytes. This is why my earlier proposal used ascii(), which is a "nerfed"(*) version of repr(). This does the right thing for numbers as well as for many other types (e.g. None, bool) and does something unpleasant for text strings that is perhaps better than the alternative. Which reminds me. Quite a few people have spoken out in favor of loud failures rather than silent "wrong" output. But I think that in the specific context of formatting output, there is a long and IMO good tradition of producing (slightly) wrong output in favor of more strict behavior. Consider for example what to do when a number doesn't fit in the given width. Would you rather raise an exception, truncate the value, or mess up the formatting? All languages newer than Fortran that I've used have chosen the latter, and I still agree it's a good idea. Similar with infinities, NaN, or None. (Yes, it's embarrassing to have a website displaying 'null'. But isn't a 500 even *more* embarrassing?) This doesn't mean I'm insensitive to the argument in favor of loud and early failure. It's just that I can see both sides of the coin, and I'm still deciding which argument is more important. (*) Gamer slang for a weapon made less dangerous. :-) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/14/2014 10:11 AM, Jim J. Jewett wrote: But in terms of explaining the text model, that separation is important enough that (2) It *may* be worth creating a virtual split in the documentation. I think (2) is a great idea. I'm willing ot work on (2) if there is general consensus that it would be a good idea. As a rough sketch, I would change places like http://docs.python.org/3/library/stdtypes.html#typebytes from: Bytes objects are immutable sequences of single bytes. Since many major binary protocols are based on the ASCII text encoding, bytes objects offer several methods that are only valid when working with ASCII compatible data and are closely related to string objects in a variety of other ways. to something more like: Bytes objects are immutable sequences of single bytes. A Bytes object could represent anything, and is appropriate as the underlying storage for a sound sample or image file. Virtual subclass ASCIIStructuredBytes One particularly common use of bytes is to represent the contents of a file, or of a network message. In these cases, the bytes will often represent Text *in a specific encoding* and that encoding will usually be a superset of ASCII. Rather than create and support an ASCIIStructuredBytes subclass, Python simply added support for these use cases straight to Bytes objects, and assumes that this support simply won't be used when when it does not make sense. For example, bytes literals *could* be used to construct a sound sample, but the literals will be far easier to read when they are used to represent (encoded) ASCII text, such as "OPEN". I find the Virtual subclass in the title to be confusing, but I otherwise it's great. We should have that even if we do add formatting to bytes, as that message is even more important then. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov wrote: > - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. > please no -- that's the source of a lot of pain in py2 now. having a failure as a result of the value, rather than the type, of an object just makes hard-to-test for bugs. Everything will be hunky dory for development and testing, then in deployment some idiot ( ;-) ) will pass in some non-ascii compatible string and you get failure. And the person who gets the failure doesn't understand why, or they wouldn't have passed in non-ascii values in the first place... Ease of porting is nice, but let's not make it easy to port bug-prone code. -Chris > > This way *most* of the use cases of python2 will be covered without > touching the code. So: > > - b’Hello {}’.format(‘world’) >will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’) > > - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError > > - b’Status: {}’.format(200) >will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’) > > - b’Hello %s’ % (‘world’,) - the same as the first example > > - b’Connection: {}’.format(b’keep-alive’) - works > > - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept > > I think it’s OK to check the buffers for ASCII-subset only. Yes, it > will have some sort of sub-optimal performance, but then, it’s quite > rare when string formatting is used to concatenate huge buffers. > > 2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and > Py_buffer. > > -- > Yury Selivanov > > On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote: > > > > On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum > > wrote: > > > > > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon > > wrote: > > > > I have been going on the assumption that bytes.format() would > > change what > > > > '{}' meant for itself and would only interpolate bytes. That > > convenient > > > > between Python 2 and 3 since it represents what we want it to > > (str and > > > bytes > > > > under the hood, respectively), so it just falls through. We > > could also > > > add a > > > > 'b' conversion for bytes() explicitly so as to help people > > not > > > accidentally > > > > mix up things in bytes.format() and str.format(). But I was > > not > > > suggesting > > > > adding a specific format spec for bytes but instead making > > bytes.format() > > > > just do the .encode('ascii') automatically to help with compatibility > > > when a > > > > format spec was present. If people want fancy formatting for > > bytes they > > > can > > > > always do it themselves before calling bytes.format(). > > > > > > This seems hastily written (e.g. verb missing :-), and I'm not > > clear > > > on what you are (or were) actually proposing. When exactly would > > > bytes.format() need .encode('ascii')? > > > > > > I would be happy to wait a few hours or days for you to to write it > > up > > > clearly, rather than responding in a hurry. > > > > > > Sorry about that. Busy day at work + trying to stay on top of this > > entire > > conversation was a bit tough. Let me try to lay out what I'm suggesting > > for > > bytes.format() in terms of how it changes > > http://docs.python.org/3/library/string.html#format-string-syntax > > for bytes. > > > > 1. New conversion operator of 'b' that operates as PEP 460 specifies > > (i.e. > > tries to get a buffer, else calls __bytes__). The default conversion > > changes from 's' to 'b'. > > 2. Use of the conversion field adds an added step of calling > > str.encode('ascii', 'strict') on the result returned from > > calling > > __format__(). > > > > That's it. So point 1 means that the following would work in Python > > 3.5:: > > > > b'Hello, {}, how are you?'.format(b'Guido') > > b'Hello, {!b}, how are you?'.format(b'Guido') > > > > It would produce an error if you used a text argument for 'Guido' > > since str > > doesn't define __bytes__ or a buffer. That gives the EIBTI group > > their > > bytes.format() where nothing magical happens. > > > > For point 2, let's say you have the following in Python 2:: > > > > 'I have {} bottles of beer on the wall'.format(10) > > > > Under my proposal, how would you change it to get the same result > > in Python > > 2 and 3?:: > > > > b'I have {:d} bottles of beer on the wall'.format(10) > > > > In Python 2 you're just being more explicit about the format, > > otherwise > > it's the same semantics as today. In Python 3, though, this would > > translate > > into (under the hood):: > > > > b'I have {} bottles of beer on the wall'.format(format(10, > > 'd').encode('ascii', 'strict')) > > > > This leads to the same bytes value in Python 2 (since it's just > > a string) > > and in Python 3 (as everything accepted by bytes.format() is > > either bytes > > already or converted to from encoding to ASCII bytes). While > > Python 2 users > > would need to make sure they used a format spec to get the same result > >
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: >> Arbitrary binary data and ASCII compatible binary data are *different >> things* and the only argument in favour of modelling them with a single >> type is because Python 2 did it that way. Greg Ewing replied: > I would say that ASCII compatible binary data is a > *subset* of arbitrary binary data. As such, a type > designed for arbitrary binary data is a perfectly good > way of representing ASCII compatible binary data. But not when you care about the ASCII-compatible part; then you should use a subclass. Obviously, it is too late for separating bytes from AsciiStructuredBytes. PBP *may* even mean that just using the "subclass" for everything (and just the ignoring the ASCII specific methods when they aren't appropriate) was always the right implementation choice. But in terms of explaining the text model, that separation is important enough that (1) We should be reluctant to strengthen the "its really just ASCII" messages. (2) It *may* be worth creating a virtual split in the documentation. I'm willing ot work on (2) if there is general consensus that it would be a good idea. As a rough sketch, I would change places like http://docs.python.org/3/library/stdtypes.html#typebytes from: Bytes objects are immutable sequences of single bytes. Since many major binary protocols are based on the ASCII text encoding, bytes objects offer several methods that are only valid when working with ASCII compatible data and are closely related to string objects in a variety of other ways. to something more like: Bytes objects are immutable sequences of single bytes. A Bytes object could represent anything, and is appropriate as the underlying storage for a sound sample or image file. Virtual subclass ASCIIStructuredBytes One particularly common use of bytes is to represent the contents of a file, or of a network message. In these cases, the bytes will often represent Text *in a specific encoding* and that encoding will usually be a superset of ASCII. Rather than create and support an ASCIIStructuredBytes subclass, Python simply added support for these use cases straight to Bytes objects, and assumes that this support simply won't be used when when it does not make sense. For example, bytes literals *could* be used to construct a sound sample, but the literals will be far easier to read when they are used to represent (encoded) ASCII text, such as "OPEN". -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On January 14, 2014 at 12:47:35 PM, Brett Cannon (br...@python.org) wrote: > > On Tue, Jan 14, 2014 at 12:29 PM, Yury Selivanov wrote: > > > Brett, > > > > > > I like your proposal. There is one idea I have that could, > > perhaps, improve it: > > > > > > 1. “%s" and “{}” will continue to work for bytes and bytearray > in > > the following fashion: > > > > - check if __bytes__/Py_buffer supported. > > - if it is, check that the bytes are strictly in the printable > > ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc). > > Throw an error if the check fails. If not - concatenate. > > - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. > > > > > > This way *most* of the use cases of python2 will be covered without > > touching the code. So: > > > > See, I'm fine with having people update their format strings > to specify a > format spec; it's minor and isn't totally useless as it expresses > what they > mean more explicitly (e.g. "I want this to be a int, I want this > to be a > float, and I want this to be an ASCII string" using d, f, and s, > respectively). I want people to have to make a conscious decision > to fall > back on an ASCII encoding. What you are suggesting is for people > have to > make a conscious decision **not** to encode to ASCII implicitly > which is > what I'm trying to avoid with this proposal. My goal is to make > it easy to > work with ASCII but as an explicit choice to, not by default. I understand. But OTOH, this whole discussion started because of the lack of convenience to work with bytes in py3, plus it’s hard to maintain *same* codebase. Updating the code to include new ‘%b’ operators won’t help them. My proposal is based on the assumption, that most of the string formatting people usually use in python2 on ‘str’ (not ‘unicode’) is used for ascii. That’s the implicit convenience of using bytes that everybody is looking for in py3. It allows having single codebase, and provides the necessary safety. Anyways, my 2 cents. Thank you, Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Tue, Jan 14, 2014 at 12:29 PM, Yury Selivanov wrote: > Brett, > > > I like your proposal. There is one idea I have that could, > perhaps, improve it: > > > 1. “%s" and “{}” will continue to work for bytes and bytearray in > the following fashion: > > - check if __bytes__/Py_buffer supported. > - if it is, check that the bytes are strictly in the printable >ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc). >Throw an error if the check fails. If not - concatenate. > - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. > > This way *most* of the use cases of python2 will be covered without > touching the code. So: > See, I'm fine with having people update their format strings to specify a format spec; it's minor and isn't totally useless as it expresses what they mean more explicitly (e.g. "I want this to be a int, I want this to be a float, and I want this to be an ASCII string" using d, f, and s, respectively). I want people to have to make a conscious decision to fall back on an ASCII encoding. What you are suggesting is for people have to make a conscious decision **not** to encode to ASCII implicitly which is what I'm trying to avoid with this proposal. My goal is to make it easy to work with ASCII but as an explicit choice to, not by default. -Brett > - b’Hello {}’.format(‘world’) >will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’) > > - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError > > - b’Status: {}’.format(200) >will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’) > > - b’Hello %s’ % (‘world’,) - the same as the first example > > - b’Connection: {}’.format(b’keep-alive’) - works > > - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept > > I think it’s OK to check the buffers for ASCII-subset only. Yes, it > will have some sort of sub-optimal performance, but then, it’s quite > rare when string formatting is used to concatenate huge buffers. > 2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and > Py_buffer. > > -- > Yury Selivanov > > On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote: > > > > On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum > > wrote: > > > > > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon > > wrote: > > > > I have been going on the assumption that bytes.format() would > > change what > > > > '{}' meant for itself and would only interpolate bytes. That > > convenient > > > > between Python 2 and 3 since it represents what we want it to > > (str and > > > bytes > > > > under the hood, respectively), so it just falls through. We > > could also > > > add a > > > > 'b' conversion for bytes() explicitly so as to help people > > not > > > accidentally > > > > mix up things in bytes.format() and str.format(). But I was > > not > > > suggesting > > > > adding a specific format spec for bytes but instead making > > bytes.format() > > > > just do the .encode('ascii') automatically to help with compatibility > > > when a > > > > format spec was present. If people want fancy formatting for > > bytes they > > > can > > > > always do it themselves before calling bytes.format(). > > > > > > This seems hastily written (e.g. verb missing :-), and I'm not > > clear > > > on what you are (or were) actually proposing. When exactly would > > > bytes.format() need .encode('ascii')? > > > > > > I would be happy to wait a few hours or days for you to to write it > > up > > > clearly, rather than responding in a hurry. > > > > > > Sorry about that. Busy day at work + trying to stay on top of this > > entire > > conversation was a bit tough. Let me try to lay out what I'm suggesting > > for > > bytes.format() in terms of how it changes > > http://docs.python.org/3/library/string.html#format-string-syntax > > for bytes. > > > > 1. New conversion operator of 'b' that operates as PEP 460 specifies > > (i.e. > > tries to get a buffer, else calls __bytes__). The default conversion > > changes from 's' to 'b'. > > 2. Use of the conversion field adds an added step of calling > > str.encode('ascii', 'strict') on the result returned from > > calling > > __format__(). > > > > That's it. So point 1 means that the following would work in Python > > 3.5:: > > > > b'Hello, {}, how are you?'.format(b'Guido') > > b'Hello, {!b}, how are you?'.format(b'Guido') > > > > It would produce an error if you used a text argument for 'Guido' > > since str > > doesn't define __bytes__ or a buffer. That gives the EIBTI group > > their > > bytes.format() where nothing magical happens. > > > > For point 2, let's say you have the following in Python 2:: > > > > 'I have {} bottles of beer on the wall'.format(10) > > > > Under my proposal, how would you change it to get the same result > > in Python > > 2 and 3?:: > > > > b'I have {:d} bottles of beer on the wall'.format(10) > > > > In Python 2 you're just being more explicit about the format, > > otherwise > > it's the same sem
Re: [Python-Dev] PEP 460 reboot
Brett, I like your proposal. There is one idea I have that could, perhaps, improve it: 1. “%s" and “{}” will continue to work for bytes and bytearray in the following fashion: - check if __bytes__/Py_buffer supported. - if it is, check that the bytes are strictly in the printable ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc). Throw an error if the check fails. If not - concatenate. - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result. This way *most* of the use cases of python2 will be covered without touching the code. So: - b’Hello {}’.format(‘world’) will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’) - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError - b’Status: {}’.format(200) will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’) - b’Hello %s’ % (‘world’,) - the same as the first example - b’Connection: {}’.format(b’keep-alive’) - works - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept I think it’s OK to check the buffers for ASCII-subset only. Yes, it will have some sort of sub-optimal performance, but then, it’s quite rare when string formatting is used to concatenate huge buffers. 2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and Py_buffer. -- Yury Selivanov On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote: > > On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum > wrote: > > > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon > wrote: > > > I have been going on the assumption that bytes.format() would > change what > > > '{}' meant for itself and would only interpolate bytes. That > convenient > > > between Python 2 and 3 since it represents what we want it to > (str and > > bytes > > > under the hood, respectively), so it just falls through. We > could also > > add a > > > 'b' conversion for bytes() explicitly so as to help people > not > > accidentally > > > mix up things in bytes.format() and str.format(). But I was > not > > suggesting > > > adding a specific format spec for bytes but instead making > bytes.format() > > > just do the .encode('ascii') automatically to help with compatibility > > when a > > > format spec was present. If people want fancy formatting for > bytes they > > can > > > always do it themselves before calling bytes.format(). > > > > This seems hastily written (e.g. verb missing :-), and I'm not > clear > > on what you are (or were) actually proposing. When exactly would > > bytes.format() need .encode('ascii')? > > > > I would be happy to wait a few hours or days for you to to write it > up > > clearly, rather than responding in a hurry. > > > Sorry about that. Busy day at work + trying to stay on top of this > entire > conversation was a bit tough. Let me try to lay out what I'm suggesting > for > bytes.format() in terms of how it changes > http://docs.python.org/3/library/string.html#format-string-syntax > for bytes. > > 1. New conversion operator of 'b' that operates as PEP 460 specifies > (i.e. > tries to get a buffer, else calls __bytes__). The default conversion > changes from 's' to 'b'. > 2. Use of the conversion field adds an added step of calling > str.encode('ascii', 'strict') on the result returned from > calling > __format__(). > > That's it. So point 1 means that the following would work in Python > 3.5:: > > b'Hello, {}, how are you?'.format(b'Guido') > b'Hello, {!b}, how are you?'.format(b'Guido') > > It would produce an error if you used a text argument for 'Guido' > since str > doesn't define __bytes__ or a buffer. That gives the EIBTI group > their > bytes.format() where nothing magical happens. > > For point 2, let's say you have the following in Python 2:: > > 'I have {} bottles of beer on the wall'.format(10) > > Under my proposal, how would you change it to get the same result > in Python > 2 and 3?:: > > b'I have {:d} bottles of beer on the wall'.format(10) > > In Python 2 you're just being more explicit about the format, > otherwise > it's the same semantics as today. In Python 3, though, this would > translate > into (under the hood):: > > b'I have {} bottles of beer on the wall'.format(format(10, > 'd').encode('ascii', 'strict')) > > This leads to the same bytes value in Python 2 (since it's just > a string) > and in Python 3 (as everything accepted by bytes.format() is > either bytes > already or converted to from encoding to ASCII bytes). While > Python 2 users > would need to make sure they used a format spec to get the same result > in > both Python 2 and 3 for ASCII bytes, it's a minor change which also > makes > the format more explicit so it's not an inherently bad thing. > And for those > that don't want to utilize the automatic ASCII encoding they > can just not > use a format spec in the format string and just pass in bytes directly > (i.e. call __format__() themselves and
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum wrote: > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon wrote: > > I have been going on the assumption that bytes.format() would change what > > '{}' meant for itself and would only interpolate bytes. That convenient > > between Python 2 and 3 since it represents what we want it to (str and > bytes > > under the hood, respectively), so it just falls through. We could also > add a > > 'b' conversion for bytes() explicitly so as to help people not > accidentally > > mix up things in bytes.format() and str.format(). But I was not > suggesting > > adding a specific format spec for bytes but instead making bytes.format() > > just do the .encode('ascii') automatically to help with compatibility > when a > > format spec was present. If people want fancy formatting for bytes they > can > > always do it themselves before calling bytes.format(). > > This seems hastily written (e.g. verb missing :-), and I'm not clear > on what you are (or were) actually proposing. When exactly would > bytes.format() need .encode('ascii')? > > I would be happy to wait a few hours or days for you to to write it up > clearly, rather than responding in a hurry. Sorry about that. Busy day at work + trying to stay on top of this entire conversation was a bit tough. Let me try to lay out what I'm suggesting for bytes.format() in terms of how it changes http://docs.python.org/3/library/string.html#format-string-syntax for bytes. 1. New conversion operator of 'b' that operates as PEP 460 specifies (i.e. tries to get a buffer, else calls __bytes__). The default conversion changes from 's' to 'b'. 2. Use of the conversion field adds an added step of calling str.encode('ascii', 'strict') on the result returned from calling __format__(). That's it. So point 1 means that the following would work in Python 3.5:: b'Hello, {}, how are you?'.format(b'Guido') b'Hello, {!b}, how are you?'.format(b'Guido') It would produce an error if you used a text argument for 'Guido' since str doesn't define __bytes__ or a buffer. That gives the EIBTI group their bytes.format() where nothing magical happens. For point 2, let's say you have the following in Python 2:: 'I have {} bottles of beer on the wall'.format(10) Under my proposal, how would you change it to get the same result in Python 2 and 3?:: b'I have {:d} bottles of beer on the wall'.format(10) In Python 2 you're just being more explicit about the format, otherwise it's the same semantics as today. In Python 3, though, this would translate into (under the hood):: b'I have {} bottles of beer on the wall'.format(format(10, 'd').encode('ascii', 'strict')) This leads to the same bytes value in Python 2 (since it's just a string) and in Python 3 (as everything accepted by bytes.format() is either bytes already or converted to from encoding to ASCII bytes). While Python 2 users would need to make sure they used a format spec to get the same result in both Python 2 and 3 for ASCII bytes, it's a minor change which also makes the format more explicit so it's not an inherently bad thing. And for those that don't want to utilize the automatic ASCII encoding they can just not use a format spec in the format string and just pass in bytes directly (i.e. call __format__() themselves and then call str.encode() on their own). So PBP people get to have a simple way to use bytes.format() in Python 2 and 3 when dealing with things that can be represented as ASCII (just as the bytes methods allow for currently). I think this covers your desire to have numbers and anything else that can be represented as ASCII be supported for easy porting while covering my desire that any automatic encoding is clearly explicit in the format string and in no way special-cased for only some types (the introduction of a 'c' converter from PEP 460 is also fine with me). How you would want to translate this proposal with the % operator I'm not sure since it has been quite a while since I last seriously used it and so I don't think I'm in a good position to propose a shift for it. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14 January 2014 19:54, Stephen J. Turnbull wrote: > Guido van Rossum writes: > > And that is precisely my point. When you're using a format string, > > all of the format string (not just the part between { and }) had > > better use ASCII or an ASCII superset. And this (rightly) > > constrains the output to an ASCII superset as well. > > Except that if you interpolate something like Shift JIS, much of the > ASCII really isn't ASCII. That's a general issue, of course, if you > do something that requires iterated format strings, but it's far more > likely to appear to work most of the time with those encodings. > > Of course you can say "if it hurts, don't do that", but Right, that's the danger I was worried about, but the problem is that there's at least *some* minimum level of ASCII compatibility that needs to be assumed in order to define an interpolation format at all (this is the point I originally missed). For printf-style formatting, it's % along with the various formatting characters and other syntax (like digits, parentheses, variable names and "."), with the format method it's braces, brackets, colons, variable names, etc. The mini-language parser has to assume in encoding in order to interpret the format string, and that's *all* done assuming an ASCII compatible format string (which must make life interesting if you try to use an ASCII incompatible coding cookie for your source code - I'm actually not sure what the full implications of that *are* for bytes literals in Python 3). The one remaining way I could potentially see a formatb method working is along the lines of what Glenn (I think) suggested: just like struct definitions, the formatb specifier would have to consist *solely* of substitution fields. However, that's getting awfully close to being just an alternate spelling for the struct module or bytes.join at that point, which hardly makes for a compelling case to add two new methods to a builtin type. Given that one of the concepts with the Python 3 transition was to take certain problematic constructs (like ASCII compatible interpolation directly to binary without a separate encoding step) away and decide whether or not we were happy to live without them, I think this one has proven to have sufficient staying power to finally bring it back in Python 3.5 (especially given the gain in lowering the barrier to porting Python 2 code that makes heavy use of interpolation to ASCII compatible binary formats). It's certainly a decision that has its downsides, with the potential impact on users of ASCII incompatible encodings (mostly in Asia) being the main one, but I think the increased convenience in working with ASCII compatible binary protocols and file formats is worth the cost. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Guido van Rossum writes: > Of course, nobody in their right mind would use a format string > containing UTF-16 or EBCDIC. How about Shift JIS and Big 5 (traditionally "mandated by Microsoft" in their respective regions, with Shift JIS still overwhelmingly popular) and GB* ("GB18030 is not just a good idea, It's The Law")? Are the Japanese and Chinese crazy by definition? This is where I get the willies -- not that you think anybody is crazy by definition, but because I personally have to live with people who use crazy encodings for interoperability reasons, in fact about half the text I process daily for work is in those encodings. Anyway, the thought makes me shiver. GB2312 text may be encoded as EUC-CN, in which case it is ASCII-compatible, so no problem. I'm not sure if that's the encoding typically denoted by "GB2312" in email, though, and in any case it's irrelevant as most emails claiming "charset=GB2312" I receive nowadays include characters from the extension repertoires of GBK or GB18030. Shift JIS, Big 5, and GBK manage to avoid non-ASCII-compatible use of all characters significant in Python %-formatting (yay!), but .format is right out because {} are used. GB18030 in principle uses far more of the code space, including all of the syntactically significant punctuation, but in practice I don't know how many of those characters are actually assigned, let alone used. > And that is precisely my point. When you're using a format string, > all of the format string (not just the part between { and }) had > better use ASCII or an ASCII superset. And this (rightly) > constrains the output to an ASCII superset as well. Except that if you interpolate something like Shift JIS, much of the ASCII really isn't ASCII. That's a general issue, of course, if you do something that requires iterated format strings, but it's far more likely to appear to work most of the time with those encodings. Of course you can say "if it hurts, don't do that", but ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/14/2014 12:03 AM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 6:25 PM, Terry Reedy wrote: byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) b'\x00\x01\x02abcdef' re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias is the one already present is the representation of bytes, and the fact that Python code must have an ascii-compatible encoding. I don't think it's that easy. Just searching for '{' is enough to break in surprising ways I see your point. The punning problem (between a byte being both itself and a special indicator character) is worse with bytes formats than the similar pun with text, and the potential for mysterious bugs greater. (This is related to why we split 'text' and 'bytes' to begin with.) With text, we break the pun by doubling the character to escape the special meaning. This works because, 1) % and { are relatively rare in text, 2) %% and {{ are grammatically incorrect, 3) %, {, and especially %% and {{ stand out visually. With bytes, 1) there is no reason why 37 (%) and 123 ({) should be rare, 2) there is no grammatical rule against the sequences 37, 37 or 123, 123, and 3) hex escapes \x25 and \x7b, which might appear in a bytes format, do not stand out as needing doubling. My example above breaks if b'\x00' is replaced with b'\x7b'. Even if a doubling and undoubling rule were added, re.split could not be used to split the format bytes. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Glenn Linderman wrote: A mechanism could be defined where "format string" would only contain format specifications, and any other text would be considered an error. Someone already did -- it's called struct.pack(). :-) -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 9:25 PM, Nick Coghlan wrote: since this observation makes it clear that there's*no* coherent way to offer a pure binary interpolation API - the only general purpose combination mechanism for segments of binary data that can avoid making assumptions about the encodings of metacharacters is simple concatenation. That's almost true, and I'm glad that you, Guido, and all of us can understand that the currently defined python2 and python3 formatting syntaxes contain an inherent ASCII assumption, just like many internet protocols. The bitter fight is over :) However, your statement above isn't 100% accurate, so just for the pedantry of it, I'll point out why. A mechanism could be defined where "format string" would only contain format specifications, and any other text would be considered an error. The format string could have an explicit or a defined encoding, there would be no need to make an assumption about its encoding. And since it would not contain text except for format specifications, it would only be used as a rule-book on how to interpret the parameters, contributing no text of its own to the result. This wouldn't solve the problem at hand, though, which is to provide a nice migration path from Python 2 to Python 3 for code that uses ASCII-based format strings that do contribute text as well as include parameter data. Whether such a technique would be more useful than simple concatenation (or complex concatenation such as join) remains to be seen, and possibly discussed, if anyone is interested, but it probably would belong on python-ideas, since it would not address an immediate porting issue. Assuming an ASCII-in-bytes format string (but with no contributed text to the result) one could write something like b"%{koi7}s%{00}v%{big5}d%{00}v%{ShiftJIS}s%{}v%b" / ( cyrillic, len( blob ), japanese, blob ) So the encodings to be applied to each of the input parameters could be explicitly specified. The %{00}v stuff would be interpolated into the output... expressed in ASCII as hex, two characters per byte. Note that the number uses Chinese digits in the big5 encoding, but I don't know if the Chinese even use their own digits or ASCII ones these days, or what base they use, I guess it was the Babylonians that used base 60 from which our timekeeping and angular measures were derived. The example shows a null byte or two between items in the output. So there _could be_ a coherent way to offer an interpolation mechanism that is pure binary, and allows selection of encoding of str data, if and as needed. One specifier could even be an encoding to apply to any format specifiers that don't include an encoding, so in the typical case of dealing with a single language output, the appropriate encoding could be set at the beginning of the format specification and overridden by particular specifiers if need be. But while there _could be_ such an interpolation mechanism, it isn't compatible with Python 2, and the jury hasn't decided whether such a thing is sufficiently more useful than concatenation to be worth implementing. A different operator might be required, or the whole thing could be a function instead of an operator, with a similar format specification, or one more like the minilanguage used with format in python 3. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14 January 2014 15:03, Guido van Rossum wrote: > I don't think it's that easy. Just searching for '{' is enough to > break in surprising ways unless the format string is encoded in an > ASCII superset. I can think of two easy examples to illustrate this > (they're similar to the example I posted here before about the > essential ASCII-ness of %c). > > First, let's consider EBCDIC. The '{' character in ASCII is hex 7B > (decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC) > and that is the '#' character in EBCDIC. Surprised yet? > > Next, let's consider UTF-16. This encoding uses two bytes per > character (except for surrogates), so any character whose top half or > bottom half happens to be 7B hex will cause an incorrect hit for your > regular expression. Ouch. > > Of course, nobody in their right mind would use a format string > containing UTF-16 or EBCDIC. And that is precisely my point. When > you're using a format string, all of the format string (not just the > part between { and }) had better use ASCII or an ASCII superset. And > this (rightly) constrains the output to an ASCII superset as well. In case it got lost amongst the various threads, this was the argument that finally convinced me that interpolation *inherently* assumes an ASCII compatible encoding: the assumption of ASCII compatibility is embedded in the design of the formatting syntax for both printf-style formatting and the format methods. That places interpolation support squarely in the same category as all the other bytes methods that inherently assume ASCII, and thus remains consistent with the Python 3 text model. Originally I was thinking that the ASCII assumption applied only if one of the passed in *values* needed to be implicitly encoded as ASCII, without accounting for the fact that the parser itself assumed ASCII compatibility when searching for formatting metacharacters. Once Guido pointed out that oversight on my part, my objections collapsed, since this observation makes it clear that there's *no* coherent way to offer a pure binary interpolation API - the only general purpose combination mechanism for segments of binary data that can avoid making assumptions about the encodings of metacharacters is simple concatenation. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 9:03 PM, Guido van Rossum wrote: Of course, nobody in their right mind would use a format string containing UTF-16 or EBCDIC. And that is precisely my point. When you're using a format string, all of the format string (not just the part between { and }) had better use ASCII or an ASCII superset. And this (rightly) constrains the output to an ASCII superset as well. +1000 ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 6:25 PM, Terry Reedy wrote: > On 1/13/2014 4:32 PM, Guido van Rossum wrote: > >> I will doggedly keep posting to this thread rather than creating more >> threads. > > Please permit to to doggedly keep pointing you toward the possible solution > I posted on the tracker last October. You're talking about http://bugs.python.org/issue3982 right? >> But formatb() feels absurd to me. PEP 460 has neither a precise >> specification or any actual examples, so I can't tell whether the > > Two days ago, I reposted byteformat() here on pydev with a precise text > specification added to the code, and with an expanded test example. I have > just added another example based on your question below. That new example hasn't made it to my inbox yet, and I don't see anything very recent in that issue either. But I don't think it matters. >> intention is that the format string can *only* contain {...} sequences >> or whether it can also contain "regular" characters. Translating to >> formatb(), my question comes down to the legality of the following >> example: >> >>b'Hello, {}'.formatb(name) # Where name is some bytes object >> >> If this is allowed, it reintroduces the ASCII bias (since the >> substring 'Hello' is clearly ASCII). > > Since byteformat() uses re to find {} replacement fields, it > only has such ascii bias as re has, which I believe is not much, if any. As > far as re and byteformat are concerned, everything outside of the {...} > fields is uninterpreted bytes. As far as bytes.join is concerned, both > joiner and joined are uninterpreted bytes. > byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) > b'\x00\x01\x02abcdef' > > re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias > is the one already present is the representation of bytes, and the fact that > Python code must have an ascii-compatible encoding. I don't think it's that easy. Just searching for '{' is enough to break in surprising ways unless the format string is encoded in an ASCII superset. I can think of two easy examples to illustrate this (they're similar to the example I posted here before about the essential ASCII-ness of %c). First, let's consider EBCDIC. The '{' character in ASCII is hex 7B (decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC) and that is the '#' character in EBCDIC. Surprised yet? Next, let's consider UTF-16. This encoding uses two bytes per character (except for surrogates), so any character whose top half or bottom half happens to be 7B hex will cause an incorrect hit for your regular expression. Ouch. Of course, nobody in their right mind would use a format string containing UTF-16 or EBCDIC. And that is precisely my point. When you're using a format string, all of the format string (not just the part between { and }) had better use ASCII or an ASCII superset. And this (rightly) constrains the output to an ASCII superset as well. > The advantage of > byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) > over directly writing > b''.join([b'\x00', b'\x01', b'\x02', b'abc', b'def'] > is that one does not have to manually split the presumably constant template > into chunks and interleave them with the presumable variable chunks. Yes. And that's a great feature when the output is a known encoding that's an ASCII superset. But a terrible idea when the encoding is unconstrained. > Here is the example that I used for testing, including non-blank format > specs. > > bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: > {:7.2f}; end" > objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3) > result = byteformat(bformat, objects) > b'bytes: abc; bytearray: def; unicode: ghi; int: 123; float: 12.30; end' No surprises here. And in fact I think this is the desired outcome. > The additional advantage here is the automatic encoding of formatted strings > to bytes. As posted, byteformat() uses the str.encode defaults > (encoding='utf-8', errors='strict'). But as I said in the post, these could > become parameters to the function that are passed on to str.encode. As long as that encoding is an ASCII superset this might be useful. > The design reuses re.split, bytes.join, format, and the format > specification. By re-using the format-spec as is, the only new thing to > learn is that blank specs correspond to bytes instead of strings. This is > easier to design, implement, and learn than if the format-spec is limited to > disallow some things (after much bike-shedding over what to eliminate ;-). > > I would appreciate your comment on this proposal. It seems to be a bit weak on the bytes encoding -- I would like to see an explicit format code for those (your code looks a little clever in this area). Others will probably object that it makes it too easy to encode text by default, although I'm not sure it matters, given that the behavior is quite different from Python 2's broken treatment of interpolating Unicode in an 8-bit f
Re: [Python-Dev] PEP 460 reboot
On 2014-01-14 02:25, Terry Reedy wrote: On 1/13/2014 4:32 PM, Guido van Rossum wrote: > I will doggedly keep posting to this thread rather than creating more threads. Please permit to to doggedly keep pointing you toward the possible solution I posted on the tracker last October. But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the Two days ago, I reposted byteformat() here on pydev with a precise text specification added to the code, and with an expanded test example. I have just added another example based on your question below. intention is that the format string can *only* contain {...} sequences or whether it can also contain "regular" characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object If this is allowed, it reintroduces the ASCII bias (since the substring 'Hello' is clearly ASCII). Since byteformat() uses re to find {} replacement fields, it only has such ascii bias as re has, which I believe is not much, if any. As far as re and byteformat are concerned, everything outside of the {...} fields is uninterpreted bytes. As far as bytes.join is concerned, both joiner and joined are uninterpreted bytes. >>> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) b'\x00\x01\x02abcdef' [snip] Couldn't that suffer from false positives, i.e. binary data that happens to match? (Rare, yes, but possible.) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 5:14 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon wrote: I have been going on the assumption that bytes.format() would change what '{}' meant for itself and would only interpolate bytes. That convenient between Python 2 and 3 since it represents what we want it to (str and bytes under the hood, respectively), so it just falls through. We could also add a 'b' conversion for bytes() explicitly so as to help people not accidentally mix up things in bytes.format() and str.format(). But I was not suggesting adding a specific format spec for bytes but instead making bytes.format() just do the .encode('ascii') automatically to help with compatibility when a format spec was present. If people want fancy formatting for bytes they can always do it themselves before calling bytes.format(). This seems hastily written (e.g. verb missing :-), and I'm not clear on what you are (or were) actually proposing. When exactly would bytes.format() need .encode('ascii')? I would be happy to wait a few hours or days for you to to write it up clearly, rather than responding in a hurry. I already posted my version of this proposal, with spec and example, in the thread "byteformat() proposal: please critique", and I added more in response to your earlier post. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 4:32 PM, Guido van Rossum wrote: > I will doggedly keep posting to this thread rather than creating more threads. Please permit to to doggedly keep pointing you toward the possible solution I posted on the tracker last October. But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the Two days ago, I reposted byteformat() here on pydev with a precise text specification added to the code, and with an expanded test example. I have just added another example based on your question below. intention is that the format string can *only* contain {...} sequences or whether it can also contain "regular" characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object If this is allowed, it reintroduces the ASCII bias (since the substring 'Hello' is clearly ASCII). Since byteformat() uses re to find {} replacement fields, it only has such ascii bias as re has, which I believe is not much, if any. As far as re and byteformat are concerned, everything outside of the {...} fields is uninterpreted bytes. As far as bytes.join is concerned, both joiner and joined are uninterpreted bytes. >>> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) b'\x00\x01\x02abcdef' re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias is the one already present is the representation of bytes, and the fact that Python code must have an ascii-compatible encoding. The advantage of byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',)) over directly writing b''.join([b'\x00', b'\x01', b'\x02', b'abc', b'def'] is that one does not have to manually split the presumably constant template into chunks and interleave them with the presumable variable chunks. Here is the example that I used for testing, including non-blank format specs. bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: {:7.2f}; end" objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3) result = byteformat(bformat, objects) >>> b'bytes: abc; bytearray: def; unicode: ghi; int: 123; float: 12.30; end' The additional advantage here is the automatic encoding of formatted strings to bytes. As posted, byteformat() uses the str.encode defaults (encoding='utf-8', errors='strict'). But as I said in the post, these could become parameters to the function that are passed on to str.encode. The design reuses re.split, bytes.join, format, and the format specification. By re-using the format-spec as is, the only new thing to learn is that blank specs correspond to bytes instead of strings. This is easier to design, implement, and learn than if the format-spec is limited to disallow some things (after much bike-shedding over what to eliminate ;-). I would appreciate your comment on this proposal. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 2014-01-13 21:51, Guido van Rossum wrote: Terminology. Let's use the official terminology rather than making stuff up. The docs at http://docs.python.org/3/library/string.html#formatspec use the following terminology: Replacement field: {...}; contains field name, conversion, format spec in that order, all optional. Field name: either a decimal integer (referring to an argument by position) or an identifier (by name), or omitted (uses the next available position). Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the value, and then the format spec applies to the resulting string. If all you wanted to do was interpolate bytes then you could define a new conversion !b. This would, however, mean that the format spec would be applied to bytes. Format spec: colon, bunch of stuff, type; the type is a letter such as d (decimal) or s (string), and the stuff between the colon and the type is used to specify field width, alignment, sign, padding and such. Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what this leaves for interpolating bytes if we don't want to use {:s}. The docs at http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting don't show %b so it could still be used there, but it would be nicer to be consistent. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 3:13 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon wrote: On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy wrote: I personally would not add 'bytes % whatever'. Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. Well, % has some very strong arguments in its favor still -- for If I shift from a 'personal' to a 'BDFL' viewpoint, I have to agree. example, the sheer amount of code that currently uses it, the fact that it's as close as we get to a cross-language standard, and the This much I know. fact that nobody wants to tackle its use in the logging module (since logger objects are often shared between packages that don't know about each other). This I did not know. Anyway, the % or .format() issue seems completely orthogonal to the issues that get people riled up (which are mostly about whether using either implies some kind of ASCII compatibility). A possibly important difference between '%s' and '{:s}' is that the 's' is required in the former and optional in the latter. So in byteformat(), b'{:s}' continues to format a string (as encoded bytes) while '{:}' 'formats' a byte without having to invent a new code that does not exist in 2.7. That particular solution to "does 's' mean bytes or string" does not work for % formatting. (And that lack, in turn, is part of what lay behind the inclination expressed above.) For % formatting, I would be inclined to start with 'what does mecurial need?' or even 'does anything even really work for hg?'. Hg is part of our development ecosystem, and we have an hg rep who expressed a desire to experiment. Terry ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: Arbitrary binary data and ASCII compatible binary data are *different things* and the only argument in favour of modelling them with a single type is because Python 2 did it that way. I would say that ASCII compatible binary data is a *subset* of arbitrary binary data. As such, a type designed for arbitrary binary data is a perfectly good way of representing ASCII compatible binary data. What are you saying -- that there should be one type for ASCII compatible binary data, and another type for all binary data *except* when it's ASCII compatible? That makes no sense to me. The Python 3 text model was built on the notion of "no implicit encoding and decoding" This is nonsense. There are plenty of implicit encoding and decoding operations in Python 3. When you open a text file, it gets an encoding. After that, anything you write to it is implicitly encoded using that encoding. There's even a default encoding when you open the file, so you don't even have to be explicit about that. It's more correct to say that it was built on the notion of using separate types for encoded and decoded data, so that it's *possible* to keep track of the difference. It doesn't mean that there can't be conversions between the two types that are implicit to one degree or another. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: so the latter would be less of an attractive nuisance when writing code that needs to handle arbitrary binary formats and can't assume ASCII compatibility. Hang on a moment. What do you mean by code that "handles arbitrary binary formats"? As far as I can see, the proposed features are for code that handles *particular* binary formats. Ones with well-defined fields that are specified to contain ASCII-encoded text. It's the programmer's responsibility to make sure that the fields he's treating as ASCII really do contain ASCII, just as it's his responsibility to make sure he reads and writes a text file using the correct encoding. Now, it's possible that if you were working from an incomplete spec and some examples, you might be led to believe that a particular field was ASCII when in fact it was some ASCII superset such as latin1 or utf8. In that case, if you parsed it assuming ASCII, you would get into trouble of some sort with bytes greater than 127. However, the proposed formatting operations are concerned only with *generating* binary data, not parsing it. Under Guido's proposed semantics, all of the ASCII formatting operations are guaranteed to produce valid ASCII, regardless of what types or values are thrown at them. So as long as the field's true encoding is something ASCII-compatible, you will always generate valid data. Because I *want to use* the PEP 460 binary interpolation API, but wouldn't be able to use Guido's more lenient proposal, as it is a bug magnet in the presence of arbitrary binary data. Where exactly is this "arbitrary binary data" that you keep talking about? The only place that arbitrary bytes comes into the picture is through b"%s" % b"...", and that's defined to just pass the bytes straight through. I don't see how that could attract any bugs that weren't already present in the data being interpolated. The LHS may or may not be tainted with assumptions about ASCII compatibility, which means it effectively *is* tainted with such assumptions, which means code that needs to handle arbitrary binary data can't use it and is left without a binary interpolation feature. If I understand correctly, what concerns you here is that you can't tell by looking at b"%s" % x whether it encodes anything as ASCII without knowing the type of x. I'm not sure how serious a problem that would be. Most of the time I think it will be fairly obvious from the purpose of the code what the type of x is *intended* to be. If it's not actually that type, then clearly there's a bug somewhere. Of all such possible bugs, the one most likely to arise due to a confusion in the programmer's mind between text and bytes would be for x to be a string when it was meant to be bytes or vice versa. Due to the still-very-strong separation between text and bytes in Py3, this is unlikely to happen without something else blowing up first. Even if it does happen, it won't result in a data- dependent failure. If b"%s" % 'hello' were defined to interpolate 'hello'.encode('ascii'), then there *would* be cause for concern. But this is not what Guido proposes -- instead he proposes interpolating ascii('hello') == "'hello'". This is almost certainly *never* what the file spec calls for, so you'll find out about it very soon one way or another. Effectively this means that b"%s" % x where x is a string is useless, so I'd much prefer it to just raise an exception in that case to make the failure immediately obvious. But either way, you're not going to end up with a latent failure waiting for some non-ASCII data to come along before you notice it. To summarise, I think the idea of binary format strings being too "tainted" for a program that does not want to use ASCII formatting to rely on is mostly FUD. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:59 PM, Guido van Rossum wrote: On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman wrote: On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) That's portably fixable by switching to %d... or by adding .encode('ascii') ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 14 Jan 2014 04:58, "Guido van Rossum" wrote: > > Let me try rebooting the reboot. > > My interpretation of Nick's argument is that he are asking for a bytes > formatting language that doesn't have an implicit ASCII assumption. > > To me this feels absurd. The formatting codes (%s, %c) themselves are > expressed as ASCII characters. If you include anything else in the > format string besides formatting codes (e.g. b'<%s>'), you are giving > it as ASCII characters. I don't know what characters the EBCDIC codes > 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but > it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded. Except we allow string escapes and programmatic creation of format strings, so while ASCII snippets in formatting code are certainly easier to type, they are by no means a mandatory feature of using interpolation operations. I agree Can you roll your own binary interpolation support with join() and simple concatenation? Yes, but Antoine's proposal provides a clean and reliable approach to flexible binary templating that isn't offered by the more lenient version. My problem is with telling Python users that if they're working with ASCII compatible data, they get access to a clean interpolation mini-language for templating purposes, but if they aren't, they don't. That's the part I see as potentially breaking the text model: now you have a convenient API on a core type encouraging you to treat your data as ASCII compatible with implicit serialisation of semantic data as ASCII text, even if that may not be appropriate. If pure binary interpolation is added at the same time (regardless of the exact spelling, so long as it's as easy to access as the ASCII templating), that objection goes away. That said, the fact that the interpolation mini-languages themselves assume ASCII is the most compelling rationale I have heard so far for treating interpolation as an operation that inherently assumes ASCII compatibility - you can't use arbitrary bytes in your formatting strings without escaping the formatting characters appropriately. While I don't see that as substantially different to needing to escape them in order to retain them in the output of text or ASCII formatting, it's at least a teachable rationale for the absence of a pure binary equivalent. > If I had some byte strings in an unknown encoding (but the same > encoding for all) that I needed to concatenate I would never think of > '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.) > > If I see some code using *any* formatting operation (regardless of > whether it's %d, %r, %s or %c) I am going to assume that there is some > ASCII-ness, and if there isn't, the code's author has obscured their > goal to me. Right, that's a rationale I can explain to people. It also occurred to me that it's easier to build pure binary interpolation on top of ASCII interpolation than I previously thought: I can just check all the input values are compatible with memoryview. At that point, attempting to pass in anything that would trigger implicit encoding at the formatting stage will fail. (Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the bytes(int)) trap) > I hear the objections against b'%s' % 'x' returning b"'x'" loud and > clear, and if the noise about that sub-issue is preventing folks from > seeing the absurdity in PEP 460, we can talk about a compromise, e.g. > use %b which would require its argument to be bytes. Those bytes > should still probably be ASCII-ish, but there's no way to test that. > That's fine with me and should be fine to Nick as well -- PEP 460 > doesn't check that your encodings match (how could it? :-), nor does > plain string concatenation using +. Plus there genuinely are formats where different parts have different encodings and you rely on metadata or format definitions to know what they are. I would actually suggest something like Brett's approach for %s , but with memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate it directly, otherwise invoke normal string formatting and then do strict ASCII encoding at the end. That way people don't have to learn new formatting mini-languages and only have two new behaviours to learn: buffer exporters are interpolated directly, anything else is formatted normally and then implicitly encoding as strict ASCII. > > In my head I make the following classification of situations where you > work with bytes and/or text. > > (A) Pure binary formats (e.g. most IP-level packet formats, media > files, .pyc files, tar/zip files, compressed data, etc.). These are > handled using the struct module (e.g. tar/zip) and/or custom C > extensions (e.g. gzip). > > (B) Encoded text. Here you should just decode everything into str > objects and parse your text at that level. If you really want to > manipulate the data as bytes (e.g. because you have a lot of data to > process and very light processing) you may be a
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 5:31 PM, Donald Stufft wrote: > %s not accepting str is the major thing I’d personally be against. To be more clear b”%s” % “abc” == No b”%s” % 123 == Fine - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 5:25 PM, Eric V. Smith wrote: > On 1/13/2014 4:59 PM, Guido van Rossum wrote: >> On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman >> wrote: >>> If somehow (unlikely though it seems) we end up keeping %s (e.g. >>> strictly to ease porting), we could also keep %r as an alias for %a. >>> >>> >>> %s for strictly interpolating bytes eases porting. Sad name, but good for >>> compatibility. When the blowup happens, due to having a str type passed, the >>> porter adds the appropriate .encode(...) to the parameter, so it doesn't >>> blow up on Py 3, and it'll be OK for Py 2 as well, will it not? >> >> Lots of code uses %s with numbers too, and probably the occasional >> None or list (relying on the Python 2 near-guarantee that most >> objects' str() is their repr() and that repr() nearly guarantees to >> return only ASCII). >> >> E.g. I'm sure you can find live code doing something like >> >> headers.append('Content-Length: %s\r\n' % len(body)) >> > > That's why I think we should support %s taking bytes, int, float. And > make %b mean the same thing, if you want. But I think we need to keep %s > (however limited) for compatibility with Python 2. > > Personally, I'd be okay with %s not accepting str (by raising an exception). > > I think that would give us a large "compatibility surface" in common > with Python 2. %s not accepting str is the major thing I’d personally be against. %s taking numeric types and bytes would be fine. The main thing i’d be worried about is where the RHS may possibly contain something non ASCII that needs encoding (such as the str case). - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 4:59 PM, Guido van Rossum wrote: > On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman > wrote: >> If somehow (unlikely though it seems) we end up keeping %s (e.g. >> strictly to ease porting), we could also keep %r as an alias for %a. >> >> >> %s for strictly interpolating bytes eases porting. Sad name, but good for >> compatibility. When the blowup happens, due to having a str type passed, the >> porter adds the appropriate .encode(...) to the parameter, so it doesn't >> blow up on Py 3, and it'll be OK for Py 2 as well, will it not? > > Lots of code uses %s with numbers too, and probably the occasional > None or list (relying on the Python 2 near-guarantee that most > objects' str() is their repr() and that repr() nearly guarantees to > return only ASCII). > > E.g. I'm sure you can find live code doing something like > > headers.append('Content-Length: %s\r\n' % len(body)) > That's why I think we should support %s taking bytes, int, float. And make %b mean the same thing, if you want. But I think we need to keep %s (however limited) for compatibility with Python 2. Personally, I'd be okay with %s not accepting str (by raising an exception). I think that would give us a large "compatibility surface" in common with Python 2. Eric. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon wrote: > I have been going on the assumption that bytes.format() would change what > '{}' meant for itself and would only interpolate bytes. That convenient > between Python 2 and 3 since it represents what we want it to (str and bytes > under the hood, respectively), so it just falls through. We could also add a > 'b' conversion for bytes() explicitly so as to help people not accidentally > mix up things in bytes.format() and str.format(). But I was not suggesting > adding a specific format spec for bytes but instead making bytes.format() > just do the .encode('ascii') automatically to help with compatibility when a > format spec was present. If people want fancy formatting for bytes they can > always do it themselves before calling bytes.format(). This seems hastily written (e.g. verb missing :-), and I'm not clear on what you are (or were) actually proposing. When exactly would bytes.format() need .encode('ascii')? I would be happy to wait a few hours or days for you to to write it up clearly, rather than responding in a hurry. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 13:56:44 -0800 Guido van Rossum wrote: > On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou wrote: > > On Mon, 13 Jan 2014 13:32:28 -0800 > > Guido van Rossum wrote: > >> > >> But formatb() feels absurd to me. PEP 460 has neither a precise > >> specification or any actual examples, so I can't tell whether the > >> intention is that the format string can *only* contain {...} sequences > >> or whether it can also contain "regular" characters. Translating to > >> formatb(), my question comes down to the legality of the following > >> example: > >> > >> b'Hello, {}'.formatb(name) # Where name is some bytes object > > > > Yes, it's allowed. But so is: > > > > b'\xff\x00{}\x85{}'.formatb(payload, trailer) > > > > The ASCII bias is because of the bytes literal notation. > > But it is nevertheless there. Including arbitrary hex bytes in the > ASCII range should be a liability, unless you have memorized the hex > codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'. That's a good point. I hadn't really thought about that. > The above example (is it from a real protocol?) (no, it's cooked up) > would be just as clear > or clearer written as > > b'\xff\x00' + payload + b'\x85' + trailer > > or > > b''.join([b'\xff\x00', payload, b'\x85', trailer]) > > and reasoning about those versions requires no understanding of ASCII. Fair enough. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:36 PM, Ethan Furman wrote: > On 01/13/2014 01:20 PM, Mark Lawrence wrote: > >> On 13/01/2014 21:01, Paul Moore wrote: >> >>> >>> I think this should be for 3.5, and should not involve an accelerated >>> release of 3.5 - we should get it into the 3.5 code early and let >>> people thrash out the details during the 3.5 release cycle. >>> >> >> I disagree, it should be on pypi now so people can start trying it out, >> or as others have suggested incorporate it into >> the six module. Surely that'd make the job of getting it into 3.5 far >> easier? >> > > It's a bit harder to put a core feature on PyPI. I'm not even sure how it > would be done. Fortunately, once it is in 3.5 trunk the adventurous can > build their own and try it out that way. > You make it a function that under Python 2 and < 3.5 does what needs to be done and on 3.5 just directly calls the underlying method. People will still have to change their code, but the idea is it becomes a refactoring instead of a change in how the code is structured. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:59 PM, Guido van Rossum wrote: > On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman > wrote: >> On 1/13/2014 12:09 PM, Guido van Rossum wrote: >> >> Yeah, the %s behavior with a string argument was a messy attempt at >> compromise. I was hoping to mimick a common use of %s in Python 2, >> where it can be used with either an 8-bit string or a number as >> argument, acting like %b in the former case and like %d in the latter >> case. Not having %s at all in Python 3 means that porting requires >> more thinking (== more opportunity for mistakes when you're converting >> in bulk) and there's no easy way to write code that works in Python 2 >> and 3. >> >> If we have %b for strictly interpolating bytes, I'm fine with adding >> %a for calling ascii() on the argument and then interpolating the >> result after ASCII-encoding it. >> >> If somehow (unlikely though it seems) we end up keeping %s (e.g. >> strictly to ease porting), we could also keep %r as an alias for %a. >> >> >> %s for strictly interpolating bytes eases porting. Sad name, but good for >> compatibility. When the blowup happens, due to having a str type passed, the >> porter adds the appropriate .encode(...) to the parameter, so it doesn't >> blow up on Py 3, and it'll be OK for Py 2 as well, will it not? > > Lots of code uses %s with numbers too, and probably the occasional > None or list (relying on the Python 2 near-guarantee that most > objects' str() is their repr() and that repr() nearly guarantees to > return only ASCII). > > E.g. I'm sure you can find live code doing something like > > headers.append('Content-Length: %s\r\n' % len(body)) But if the alternative is spurious quotes then the choice is clear... ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 4:51 PM, Guido van Rossum wrote: > Terminology. Let's use the official terminology rather than making stuff > up. > > The docs at http://docs.python.org/3/library/string.html#formatspec > use the following terminology: > > Replacement field: {...}; contains field name, conversion, format spec > in that order, all optional. > > Field name: either a decimal integer (referring to an argument by > position) or an identifier (by name), or omitted (uses the next > available position). > > Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the > value, and then the format spec applies to the resulting string. > > Format spec: colon, bunch of stuff, type; the type is a letter such as > d (decimal) or s (string), and the stuff between the colon and the > type is used to specify field width, alignment, sign, padding and > such. > > > Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what > this leaves for interpolating bytes if we don't want to use {:s}. The > docs at > http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting > don't show %b so it could still be used there, but it would be nicer > to be consistent. I have been going on the assumption that bytes.format() would change what '{}' meant for itself and would only interpolate bytes. That convenient between Python 2 and 3 since it represents what we want it to (str and bytes under the hood, respectively), so it just falls through. We could also add a 'b' conversion for bytes() explicitly so as to help people not accidentally mix up things in bytes.format() and str.format(). But I was not suggesting adding a specific format spec for bytes but instead making bytes.format() just do the .encode('ascii') automatically to help with compatibility when a format spec was present. If people want fancy formatting for bytes they can always do it themselves before calling bytes.format(). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman wrote: > On 1/13/2014 12:09 PM, Guido van Rossum wrote: > > Yeah, the %s behavior with a string argument was a messy attempt at > compromise. I was hoping to mimick a common use of %s in Python 2, > where it can be used with either an 8-bit string or a number as > argument, acting like %b in the former case and like %d in the latter > case. Not having %s at all in Python 3 means that porting requires > more thinking (== more opportunity for mistakes when you're converting > in bulk) and there's no easy way to write code that works in Python 2 > and 3. > > If we have %b for strictly interpolating bytes, I'm fine with adding > %a for calling ascii() on the argument and then interpolating the > result after ASCII-encoding it. > > If somehow (unlikely though it seems) we end up keeping %s (e.g. > strictly to ease porting), we could also keep %r as an alias for %a. > > > %s for strictly interpolating bytes eases porting. Sad name, but good for > compatibility. When the blowup happens, due to having a str type passed, the > porter adds the appropriate .encode(...) to the parameter, so it doesn't > blow up on Py 3, and it'll be OK for Py 2 as well, will it not? Lots of code uses %s with numbers too, and probably the occasional None or list (relying on the Python 2 near-guarantee that most objects' str() is their repr() and that repr() nearly guarantees to return only ASCII). E.g. I'm sure you can find live code doing something like headers.append('Content-Length: %s\r\n' % len(body)) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou wrote: > On Mon, 13 Jan 2014 13:32:28 -0800 > Guido van Rossum wrote: >> >> But formatb() feels absurd to me. PEP 460 has neither a precise >> specification or any actual examples, so I can't tell whether the >> intention is that the format string can *only* contain {...} sequences >> or whether it can also contain "regular" characters. Translating to >> formatb(), my question comes down to the legality of the following >> example: >> >> b'Hello, {}'.formatb(name) # Where name is some bytes object > > Yes, it's allowed. But so is: > > b'\xff\x00{}\x85{}'.formatb(payload, trailer) > > The ASCII bias is because of the bytes literal notation. But it is nevertheless there. Including arbitrary hex bytes in the ASCII range should be a liability, unless you have memorized the hex codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'. The above example (is it from a real protocol?) would be just as clear or clearer written as b'\xff\x00' + payload + b'\x85' + trailer or b''.join([b'\xff\x00', payload, b'\x85', trailer]) and reasoning about those versions requires no understanding of ASCII. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:20 PM, Mark Lawrence wrote: On 13/01/2014 21:01, Paul Moore wrote: I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. I disagree, it should be on pypi now so people can start trying it out, or as others have suggested incorporate it into the six module. Surely that'd make the job of getting it into 3.5 far easier? It's a bit harder to put a core feature on PyPI. I'm not even sure how it would be done. Fortunately, once it is in 3.5 trunk the adventurous can build their own and try it out that way. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Nick Coghlan wrote: By allowing format characters that *do* assume ASCII, the entire construct is rendered unsafe - you have to look inside the format string to determine if it is assuming ASCII compatibility or not, thus the entire construct must be deemed as assuming ASCII compatibility at the level of static semantic analysis. I don't see how any of the currently proposed formatting operations make a data-dependent ASCII assumption. When you write b"%d" % x, you're not assuming that x is ASCII, you're assuming that it's an *integer*. The %d conversion of an integer is defined to produce only ASCII characters, and it works on any integer, so there's no data-dependent assumption there. Something that *would* involve such an assumption would be if b"%s" % 'hello' were defined to encode 'hello' as ASCII. But Guido has proposed not doing that, and instead interpolating ascii('hello'). Since ascii() is defined to return only ASCII characters, and works on any string, there is again no data-dependent assumption. My preference would be for b"%s" % 'hello' to raise an exception, but that would still be data-independent. As for having to look inside the format string to know what types are expected, that's no different from any other formatting operation. All it means is that static type analysis in Python is hard, but we already knew that. Allowing these ASCII assuming format codes in the core bytes interpolation introduces *exactly* the same problem as is present in the Python 2 text model: code that *appears* to support arbitrary binary data, but is in fact assuming ASCII compatibility. Can you provide an example of code using Guido's currently approved formatting semantics that would fail when given arbitrary binary data? I don't see how it can happen. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Terminology. Let's use the official terminology rather than making stuff up. The docs at http://docs.python.org/3/library/string.html#formatspec use the following terminology: Replacement field: {...}; contains field name, conversion, format spec in that order, all optional. Field name: either a decimal integer (referring to an argument by position) or an identifier (by name), or omitted (uses the next available position). Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the value, and then the format spec applies to the resulting string. Format spec: colon, bunch of stuff, type; the type is a letter such as d (decimal) or s (string), and the stuff between the colon and the type is used to specify field width, alignment, sign, padding and such. Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what this leaves for interpolating bytes if we don't want to use {:s}. The docs at http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting don't show %b so it could still be used there, but it would be nicer to be consistent. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 01:08 PM, Glenn Linderman wrote: +1 - what Ethan said. A real death, instead death by inappropriately transformed data, is fine by me, if b"%s" % str(...) doesn't have the appropriate .encode(...) call. But I could live with either. You mean instead of death by a thousand quotes? *ducks and runs* -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 13:32:28 -0800 Guido van Rossum wrote: > > But formatb() feels absurd to me. PEP 460 has neither a precise > specification or any actual examples, so I can't tell whether the > intention is that the format string can *only* contain {...} sequences > or whether it can also contain "regular" characters. Translating to > formatb(), my question comes down to the legality of the following > example: > > b'Hello, {}'.formatb(name) # Where name is some bytes object Yes, it's allowed. But so is: b'\xff\x00{}\x85{}'.formatb(payload, trailer) The ASCII bias is because of the bytes literal notation. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 12:09 PM, Guido van Rossum wrote: Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. %s for strictly interpolating bytes eases porting. Sad name, but good for compatibility. When the blowup happens, due to having a str type passed, the porter adds the appropriate .encode(...) to the parameter, so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it not? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
I will doggedly keep posting to this thread rather than creating more threads. In another thread, Nick has said he's okay with my proposal (not sure if that includes %s or not, but it now seems of lesser importance) as long as we simultaneously introduce formatb() and formatb_map() (the latter is just a minor variation of the former, so I won't mention it further). But formatb() feels absurd to me. PEP 460 has neither a precise specification or any actual examples, so I can't tell whether the intention is that the format string can *only* contain {...} sequences or whether it can also contain "regular" characters. Translating to formatb(), my question comes down to the legality of the following example: b'Hello, {}'.formatb(name) # Where name is some bytes object If this is allowed, it reintroduces the ASCII bias (since the substring 'Hello' is clearly ASCII). If this isn't allowed, it feels like a perversion of the notion of a "formatting language", and I really don't see the attraction over using a combination of concatenation and the struct module, perhaps augmented with some use of bytes([i]) as an alternative to %c or {!c} (if that is what is meant by PEP 460 with 'c modifier' -- I can't find the word 'modifier' in the docs for format(). Note that I honestly don't understand which of these PEP 460 means. Either way, PEP 460's motivation seems kind of subjective and esthetic: """ While there are reasonably efficient ways to accumulate binary data (such as using a bytearray object, the bytes.join method or even io.BytesIO), none of them leads to the kind of readable and intuitive code that is produced by a %-formatted or {}-formatted template and a formatting operation. """ I would buy this if a binary format string could contain embedded text (like 'Hello' in my example above), but then the argument about avoiding ASCII bias seems to fall apart so I am at a loss about what Nick actually wants, and even about what PEP 460 actually specifies. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Glenn Linderman wrote: Quotes in the stream are a great debug hint, without blowing up. But do you really want those quotes turning up in a *binary* stream, where they're somewhere between awkward and near-impossible to spot by eyeballing, and may only be discovered when something else -- likely a different program, possibly being run by a different person -- tries to read the data back, and blows up because the binary format is corrupted? I'd much rather it blew up at the writing stage, myself. Corrupted binary data is *much* harder to debug than corrupted text, because binary formats typically have little to no margin for error before they become complete garbage. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13/01/2014 21:01, Paul Moore wrote: I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. I disagree, it should be on pypi now so people can start trying it out, or as others have suggested incorporate it into the six module. Surely that'd make the job of getting it into 3.5 far easier? Paul. PS For all the heated arguments and occasional frayed tempers, this has been an impressively civil debate. I think that's one of the best things about python-dev, that discussions like these never degenerate into flamewars. Kudos to all concerned! +1 -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 9:38 AM, Ethan Furman wrote: On 01/13/2014 09:31 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This makes sense to me. So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the blowing up method. +1 - what Ethan said. A real death, instead death by inappropriately transformed data, is fine by me, if b"%s" % str(...) doesn't have the appropriate .encode(...) call. But I could live with either. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 10:40 AM, Brett Cannon wrote: This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. So quote 3 is necessarily a violation of quote 1. But if quote 2 can allow for one exception to its absolute consistency... that is probably the best solution overall... ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot and a bitter fight
On 1/13/2014 5:06 AM, Nick Coghlan wrote: I figured out tonight that it's only positioning ASCII interpolation as an*alternative* to adding binary interpolation that I have a problem with. It isn't, because you lose the structural assurance that you haven't inadvertently introduced an assumption of ASCII compatibility when you didn't need to. However, interpolation support is a convenient enough interface that I can see a version that*only* supports ASCII compatible interpolation being an attractive nuisance that becomes a source of hard to detect and fix data corruption bugs (just like the str type in Python 2). If we add both, my objections go away: people like me can use the Python 3 only formatb and formatb_map methods and be confident we haven't inadvertently introduced any assumptions regarding ASCII compatibility, while folks that know they're dealing with an ASCII compatible format can use the ASCII assuming versions that are designed to be source compatible with Python 2. If someone incorrectly uses format() or format_map() when they should be using the pure binary versions, that's a trivial bug fix (adding the necessary "b", and perhaps some explicit encoding calls) rather than a major restructuring of the code. If they use mod-formatting, that's a slightly bigger fix, but still just switching to a different spelling of the formatting operation. Both use cases (binary only and ASCII compatible) get covered cleanly, and nobody has to lose out. Cheers, Nick. As part of that, what about an alternate spelling of % to allow binary-only interpolation operations using the handy syntax of % ? Doesn't seem like / is defined for bytes or str on the LHS. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 13 January 2014 18:58, Guido van Rossum wrote: > I hear the objections against b'%s' % 'x' returning b"'x'" loud and > clear, and if the noise about that sub-issue is preventing folks from > seeing the absurdity in PEP 460, we can talk about a compromise, e.g. > use %b which would require its argument to be bytes. Those bytes > should still probably be ASCII-ish, but there's no way to test that. > That's fine with me and should be fine to Nick as well -- PEP 460 > doesn't check that your encodings match (how could it? :-), nor does > plain string concatenation using +. For the record, Guido's reboot posting and rationale has convinced me, and I am essentially in favour of his proposal. Nick's remaining objection seems to me to have some validity if the format string is a user-supplied variable, but this type of usage is vanishingly small in my experience, and shouldn't dictate the whole design. I don't like b'%s' % 'x' behaviour, and would prefer one of the alternatives. I'm not entirely clear about the details of the alternative proposals, so I won't try to pick one. I think this should be for 3.5, and should not involve an accelerated release of 3.5 - we should get it into the 3.5 code early and let people thrash out the details during the 3.5 release cycle. Paul. PS For all the heated arguments and occasional frayed tempers, this has been an impressively civil debate. I think that's one of the best things about python-dev, that discussions like these never degenerate into flamewars. Kudos to all concerned! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Guido van Rossum wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman wrote: On 01/12/2014 04:47 PM, Guido van Rossum wrote: b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x' enclosed in single quotes) I'm not sure about the quotes. Would anyone ever actually want those in the byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of it as payback time. :-) If it's never useful, wouldn't it be better to raise an exception in this case? That way, someone porting code from py2 that does this without appropriate modification will find out about the problem immediately, rather than have spurious quotes inserted into their binary data, which -- being binary data -- will likely go unnoticed until something else tries to read the data. I don't think the rule against operations that work on all-but-one-type really applies here, because the mistake it's intended to catch is not an obscure corner case. If your program's logic includes interpolating strings into bytes objects, then you're going to be testing that. -- Greg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:49 AM, Mark Shannon wrote: So why not replace '%s' with '%a' for the ascii case and with '%b' for directly inserting bytes. Because %a and %b don't exist in Python 2.7? I thought this was about 3.5, not 2.7 ;) '%s' can't work in 3.5, as we must differentiate between strings which meed to be encoded and bytes which don't. It's about migrating code to reach a point where it can work on both 2.7 and 3.5. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 12:02 PM, Brett Cannon wrote: Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. Hey, now, some of us like %! ;) -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 03:09 PM, Guido van Rossum wrote: > If we have %b for strictly interpolating bytes, I'm fine with adding > %a for calling ascii() on the argument and then interpolating the > result after ASCII-encoding it. > > If somehow (unlikely though it seems) we end up keeping %s (e.g. > strictly to ease porting), we could also keep %r as an alias for %a. Wouldn't %s as an alias for %b simplify porting from Python 2? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 3:11 PM, Yury Selivanov wrote: > On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote: >> >> I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar" >> >> Instead of "%b" could "%j" mean "I should have used + or join() >> here >> but was too lazy" and work on str too? > > Isn’t this just error prone? Since it’s a new format character, many, > probably, would write %s by mistake. And, besides, there was no %j > in python2. Merely a flesh wound. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon wrote: > On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy wrote: >> I personally would not add 'bytes % whatever'. > > Personally, neither would I; just focus on bytes.format() and let % operator > on strings slowly go away. Well, % has some very strong arguments in its favor still -- for example, the sheer amount of code that currently uses it, the fact that it's as close as we get to a cross-language standard, and the fact that nobody wants to tackle its use in the logging module (since logger objects are often shared between packages that don't know about each other). Anyway, the % or .format() issue seems completely orthogonal to the issues that get people riled up (which are mostly about whether using either implies some kind of ASCII compatibility). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote: > > I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar" > > Instead of "%b" could "%j" mean "I should have used + or join() > here > but was too lazy" and work on str too? Isn’t this just error prone? Since it’s a new format character, many, probably, would write %s by mistake. And, besides, there was no %j in python2. - Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 11:57 AM, Barry Warsaw wrote: > On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote: >>On Jan 13, 2014, at 1:58 PM, Guido van Rossum wrote: >>> I hear the objections against b'%s' % 'x' returning b"'x'" loud and >>> clear, and if the noise about that sub-issue is preventing folks from >>> seeing the absurdity in PEP 460, we can talk about a compromise, e.g. >>> use %b which would require its argument to be bytes. Those bytes >>> should still probably be ASCII-ish, but there's no way to test that. >>> That's fine with me and should be fine to Nick as well -- PEP 460 >>> doesn't check that your encodings match (how could it? :-), nor does >>> plain string concatenation using +. >>I think disallowing %s is the right thing to do, but I definitely think >>numbers and %b should be allowed. > I guess I agree. The behavior of b'%s' % 'x' returning b"'x'" is almost > always useless at best. (I would have thought maybe %a for ascii() but don't > care that strongly.) Yeah, the %s behavior with a string argument was a messy attempt at compromise. I was hoping to mimick a common use of %s in Python 2, where it can be used with either an 8-bit string or a number as argument, acting like %b in the former case and like %d in the latter case. Not having %s at all in Python 3 means that porting requires more thinking (== more opportunity for mistakes when you're converting in bulk) and there's no easy way to write code that works in Python 2 and 3. If we have %b for strictly interpolating bytes, I'm fine with adding %a for calling ascii() on the argument and then interpolating the result after ASCII-encoding it. If somehow (unlikely though it seems) we end up keeping %s (e.g. strictly to ease porting), we could also keep %r as an alias for %a. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar" Instead of "%b" could "%j" mean "I should have used + or join() here but was too lazy" and work on str too? On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy wrote: > On 1/13/2014 1:40 PM, Brett Cannon wrote: > >> > So bytes formatting really needn't (and shouldn't, IMO) mirror str >> > formatting. > > > This was my presumption in writing byteformat(). > > >> I think one of the things about Guido's proposal that bugs me is that it >> breaks the mental model of the .format() method from str in terms of how >> the mini-language works. For str.format() you have the conversion and >> the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the >> conversion by calling the appropriate built-in, e.g. 'r' calls repr(). >> The format spec semantically gets passed with the object to format() >> which calls the object's __format__() method: ``format(number, 'd')``. >> >> Now Guido's suggestion has two parts that affect the mini-language for >> .format(). One is that for bytes.format() the default conversion is >> bytes() instead of str(), which is fine (probably want to add 'b' as a >> conversion value as well to be consistent). But the other bit is that >> the format spec goes from semantically meaning ``format(thing, >> format_spec)`` to ``format(thing, format_spec).encode('ascii', >> 'strict')`` for at least numbers. That implicitness bugs me as I have >> always thought of format specs just leading to a call to format(). I >> think I can live with it, though, as long as it is **consistently** >> applied across the board for bytes.format(); every use of a format spec >> leads to calling ``format(thing, format_spec).encode('ascii', >> 'strict')`` no matter what type 'thing' would be and it is clearly >> documented that this is done to ease porting and handle the common case >> then I can live with it. > > > This is how my byteformat function works, except that when no format_spec is > given, byte and bytearrary objects are left unchanged rather than being > decoded and encoded again. > > >> This even gives people in-place ASCII encoding for strings by always >> using '{:s}' with text which they can do when they port their code to >> run under both Python 2 and 3. So you should be able to do >> ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. >> If you want more explicit encoding to latin-1 then you need to do it >> explicitly and not rely on the mini-language to do tricks for you. >> >> IOW I want to treat the format mini-language as a language and thus not >> have any special-casing or massive shifts in meaning between >> str.format() and bytes.format() so my mental model doesn't have to >> contort based on whether it's str or bytes. My preference is not have >> any, but if Guido is going say PBP here then I want absolute consistency >> across the board in how bytes.format() tweaks things. >> >> As for %s for the % operator calling ascii(), I think that will be a >> porting nightmare of finding out why your bytes suddenly stopped being >> formatted properly and then having to crawl through all of your code for >> that one use of %s which is getting bytes in. By raising a TypeError you >> will very easily detect where your screw-up occurred thanks to the >> traceback; do so otherwise feels too much like implicit type conversion >> and ask any JavaScript developer how that can be a bad thing. > > > I personally would not add 'bytes % whatever'. > > -- > Terry Jan Reedy > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy wrote: > On 1/13/2014 1:40 PM, Brett Cannon wrote: > > > So bytes formatting really needn't (and shouldn't, IMO) mirror str >> > formatting. >> > > This was my presumption in writing byteformat(). > > > I think one of the things about Guido's proposal that bugs me is that it >> breaks the mental model of the .format() method from str in terms of how >> the mini-language works. For str.format() you have the conversion and >> the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the >> conversion by calling the appropriate built-in, e.g. 'r' calls repr(). >> The format spec semantically gets passed with the object to format() >> which calls the object's __format__() method: ``format(number, 'd')``. >> >> Now Guido's suggestion has two parts that affect the mini-language for >> .format(). One is that for bytes.format() the default conversion is >> bytes() instead of str(), which is fine (probably want to add 'b' as a >> conversion value as well to be consistent). But the other bit is that >> the format spec goes from semantically meaning ``format(thing, >> format_spec)`` to ``format(thing, format_spec).encode('ascii', >> 'strict')`` for at least numbers. That implicitness bugs me as I have >> always thought of format specs just leading to a call to format(). I >> think I can live with it, though, as long as it is **consistently** >> applied across the board for bytes.format(); every use of a format spec >> leads to calling ``format(thing, format_spec).encode('ascii', >> 'strict')`` no matter what type 'thing' would be and it is clearly >> documented that this is done to ease porting and handle the common case >> then I can live with it. >> > > This is how my byteformat function works, except that when no format_spec > is given, byte and bytearrary objects are left unchanged rather than being > decoded and encoded again. Right, which is what the default conversion covers. And as your code shows this can be made available today without having to wait for Python 3.5 and so can go up on PyPI and be used **today**. > > > This even gives people in-place ASCII encoding for strings by always >> using '{:s}' with text which they can do when they port their code to >> run under both Python 2 and 3. So you should be able to do >> ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. >> If you want more explicit encoding to latin-1 then you need to do it >> explicitly and not rely on the mini-language to do tricks for you. >> >> IOW I want to treat the format mini-language as a language and thus not >> have any special-casing or massive shifts in meaning between >> str.format() and bytes.format() so my mental model doesn't have to >> contort based on whether it's str or bytes. My preference is not have >> any, but if Guido is going say PBP here then I want absolute consistency >> across the board in how bytes.format() tweaks things. >> >> As for %s for the % operator calling ascii(), I think that will be a >> porting nightmare of finding out why your bytes suddenly stopped being >> formatted properly and then having to crawl through all of your code for >> that one use of %s which is getting bytes in. By raising a TypeError you >> will very easily detect where your screw-up occurred thanks to the >> traceback; do so otherwise feels too much like implicit type conversion >> and ask any JavaScript developer how that can be a bad thing. >> > > I personally would not add 'bytes % whatever'. Personally, neither would I; just focus on bytes.format() and let % operator on strings slowly go away. -Brett > > > -- > Terry Jan Reedy > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/ > brett%40python.org > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote: > >On Jan 13, 2014, at 1:58 PM, Guido van Rossum wrote: > >> I hear the objections against b'%s' % 'x' returning b"'x'" loud and >> clear, and if the noise about that sub-issue is preventing folks from >> seeing the absurdity in PEP 460, we can talk about a compromise, e.g. >> use %b which would require its argument to be bytes. Those bytes >> should still probably be ASCII-ish, but there's no way to test that. >> That's fine with me and should be fine to Nick as well -- PEP 460 >> doesn't check that your encodings match (how could it? :-), nor does >> plain string concatenation using +. > >I think disallowing %s is the right thing to do, but I definitely think >numbers and %b should be allowed. I guess I agree. The behavior of b'%s' % 'x' returning b"'x'" is almost always useless at best. (I would have thought maybe %a for ascii() but don't care that strongly.) -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 1/13/2014 1:40 PM, Brett Cannon wrote: > So bytes formatting really needn't (and shouldn't, IMO) mirror str > formatting. This was my presumption in writing byteformat(). I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This is how my byteformat function works, except that when no format_spec is given, byte and bytearrary objects are left unchanged rather than being decoded and encoded again. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. I personally would not add 'bytes % whatever'. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 1:58 PM, Guido van Rossum wrote: > I hear the objections against b'%s' % 'x' returning b"'x'" loud and > clear, and if the noise about that sub-issue is preventing folks from > seeing the absurdity in PEP 460, we can talk about a compromise, e.g. > use %b which would require its argument to be bytes. Those bytes > should still probably be ASCII-ish, but there's no way to test that. > That's fine with me and should be fine to Nick as well -- PEP 460 > doesn't check that your encodings match (how could it? :-), nor does > plain string concatenation using +. I think disallowing %s is the right thing to do, but I definitely think numbers and %b should be allowed. - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA signature.asc Description: Message signed with OpenPGP using GPGMail ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
Let me try rebooting the reboot. My interpretation of Nick's argument is that he are asking for a bytes formatting language that doesn't have an implicit ASCII assumption. To me this feels absurd. The formatting codes (%s, %c) themselves are expressed as ASCII characters. If you include anything else in the format string besides formatting codes (e.g. b'<%s>'), you are giving it as ASCII characters. I don't know what characters the EBCDIC codes 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded. If I had some byte strings in an unknown encoding (but the same encoding for all) that I needed to concatenate I would never think of '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.) If I see some code using *any* formatting operation (regardless of whether it's %d, %r, %s or %c) I am going to assume that there is some ASCII-ness, and if there isn't, the code's author has obscured their goal to me. I hear the objections against b'%s' % 'x' returning b"'x'" loud and clear, and if the noise about that sub-issue is preventing folks from seeing the absurdity in PEP 460, we can talk about a compromise, e.g. use %b which would require its argument to be bytes. Those bytes should still probably be ASCII-ish, but there's no way to test that. That's fine with me and should be fine to Nick as well -- PEP 460 doesn't check that your encodings match (how could it? :-), nor does plain string concatenation using +. In my head I make the following classification of situations where you work with bytes and/or text. (A) Pure binary formats (e.g. most IP-level packet formats, media files, .pyc files, tar/zip files, compressed data, etc.). These are handled using the struct module (e.g. tar/zip) and/or custom C extensions (e.g. gzip). (B) Encoded text. Here you should just decode everything into str objects and parse your text at that level. If you really want to manipulate the data as bytes (e.g. because you have a lot of data to process and very light processing) you may be able to do it, but unless it's a verbatim copy, you are probably going to make assumptions about the encoding. You are also probably going to mess up for some encodings (e.g. leave BOM turds in the middle of a file). (C) Loosely text-based protocols and formats that have an ASCII assumption in the spec. Most classic Internet protocols (FTP, SMTP, HTTP, IRC, etc.) fall in this category; I expect there are also plenty of file formats using similar conventions (e.g. mailbox files). These protocols and formats often require text-ish manipulations, e.g. for case-insensitive headers or commands, or to split things at whitespace. This is where I find uses for the current ASCII-assuming bytes operations (e.g. b.lower(), b.split(), but also int(b)) and where the lack of number formatting (especially %d and %x) is most painful. I see no benefit in forcing the programmer writing such protocol code handling to use more cumbersome ways of converting between numbers and bytes, nor in forcing them to insert an encoding/decoding layer -- these protocols often switch between text and binary data at line boundaries, so the most basic part of parsing (splitting the input into lines) must still happen in the realm of bytes. IMO PEP 460 and the mindset that goes with it don't apply to any of these three cases. Also, IMO requiring a new type to handle (C) also seems adding too much complexity, and adds to porting efforts. I may have felt differently in the past, but ATM I feel that if newer versions of Python 3 make porting of Python 2 code easier, through minor compromises, that's a *good* thing. (Example: adding u"..." literals to 3.3.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 09:12 AM, Nick Coghlan wrote: On 14 January 2014 01:54, Ethan Furman wrote: Forgive me for being dense, but I don't understand your objection. With Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged. Did you mean something else by "binary data"? I mean it will work, but it will mean you've introduced an implicit assumption of ASCII compatibility into the structure your program Okay, I'm still trying to understand. Apparently we both mean the same thing by binary data / bytes, so the difference must be the %s, yes? And the concern as that because you have used %s as the format code, if somebody accidentally put, say, "stupid bug" on the RHS you would end up with b"'stupid bug'" instead of an exception, which you get if you had used %b instead. Am I following? -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Jan 13, 2014, at 1:45 PM, Daniel Holth wrote: > On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray > wrote: >> On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou >> wrote: >>> On Sun, 12 Jan 2014 18:11:47 -0800 >>> Guido van Rossum wrote: On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman wrote: > On 01/12/2014 04:47 PM, Guido van Rossum wrote: >> %s seems the trickiest: I think with a bytes argument it should just >> insert those bytes (and the padding modifiers should work too), and >> for other types it should probably work like %a, so that it works as >> expected for numeric values, and with a string argument it will return >> the ascii()-variant of its repr(). Examples: >> >> b'%s' % 42 == b'42' >> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x' >> enclosed in single quotes) > > I'm not sure about the quotes. Would anyone ever actually want those in > the > byte stream? Perhaps not, but it's a hint that you should probably think about an encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of it as payback time. :-) >>> >>> What is the use case for embedding a quoted ASCII-encoded representation >>> in a byte stream? >> >> There is no use case in the sense you are asking, just like there is no >> real use case for '%s' % b'x' producing "b'x'". But the real use case >> is exactly the same: to let you know your code is screwed up without >> actually blowing up with a encoding Exception. >> >> For the record, I like Guido's logic and proposal. I don't understand >> Nick's objection, since I don't see the difference between the situation >> here where a string gets interpolated into bytes as 'xxx' and the >> corresponding situation where bytes gets interpolated into a string >> as b'xxx'. Why struggle to keep bytes interpolation "pure" if string >> interpolation isn't? >> >> Guido's proposal makes the language more symmetric, and thus more >> consistent and less surprising. Exactly the hallmarks of Python's design >> sense, IMO. (Big surprise, right? :) >> >> Of course, this point of view *is* based on the idea that when you are >> doing interpolation using %/.format, you are in fact primarily concerned >> with ASCII compatible byte streams. This is a Practicality sort of >> argument. It is, after all, by far the most common use case when >> doing interpolation[*]. >> >> If you wanted to do a purist version of this symmetry, you'd have bytes(x) >> calling __bytes__ if it was defined and falling back to calling a >> __brepr__ otherwise. >> >> But what would __brepr__ implement? The variety of format codes in >> the struct module argues that there is no "one obvious" binary >> repr for most types. (Those that have one would implement __bytes__). >> And what would be the __brepr__ of an arbitrary 'object'? >> >> Faced with the impracticality of defining __brepr__ usefully in any "pure >> bytes" form, it seems sensible to admit that the most useful __brepr__ >> is the ascii() encoding of the __repr__. Which naturally produces 'xxx' >> as the __brepr__ of a string. >> >> This does cause things to get a little un-pretty when you are operating >> at the python prompt: >> > b'%s' % object >>b'""' >> >> But then again that is most likely really not what you mean to do, so >> it becomes a big red flag...just like b'xxx' is a small red flag when >> you accidentally interpolate unencoded bytes into a string. >> >> --David >> >> PS: When I first read Guido's remark that the result of interpolating a >> string should be 'xxx', I went Wah? I had to reason my way through to >> it as above, but to him it was just the natural answer. Guido isn't >> always right, but this kind of automatic language design consistency >> is one reason he's the BDFL. >> >> [*] I still think that you mostly want to design your library so that >> you are handling the text parts as text and the bytes parts as bytes, >> and encoding/gluing them as appropriate at the IO boundary. But if Guido >> says his real code would benefit by being able to interpolate ASCII into >> bytes at certain points, I'll believe him. > > > > If you think corrupted data is easier or more pleasant to track down > than encoding exceptions then I think you are strange. It makes > porting really difficult while you are still trying to figure out > where the bytes/str boundaries are. I am now deeply suspicious of all > % formatting. > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/donald%40stufft.io For the record, I think %d and %f and such where the RHS is guaranteed to have a certain set of “characters” that are guaranteed to be ascii compatible is fine and it’s perfectly acceptable to have an implicit ASCII encode for them. The %s code
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray wrote: > On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou > wrote: >> On Sun, 12 Jan 2014 18:11:47 -0800 >> Guido van Rossum wrote: >> > On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman wrote: >> > > On 01/12/2014 04:47 PM, Guido van Rossum wrote: >> > >> %s seems the trickiest: I think with a bytes argument it should just >> > >> insert those bytes (and the padding modifiers should work too), and >> > >> for other types it should probably work like %a, so that it works as >> > >> expected for numeric values, and with a string argument it will return >> > >> the ascii()-variant of its repr(). Examples: >> > >> >> > >> b'%s' % 42 == b'42' >> > >> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x' >> > >> enclosed in single quotes) >> > > >> > > I'm not sure about the quotes. Would anyone ever actually want those in >> > > the >> > > byte stream? >> > >> > Perhaps not, but it's a hint that you should probably think about an >> > encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of >> > it as payback time. :-) >> >> What is the use case for embedding a quoted ASCII-encoded representation >> in a byte stream? > > There is no use case in the sense you are asking, just like there is no > real use case for '%s' % b'x' producing "b'x'". But the real use case > is exactly the same: to let you know your code is screwed up without > actually blowing up with a encoding Exception. > > For the record, I like Guido's logic and proposal. I don't understand > Nick's objection, since I don't see the difference between the situation > here where a string gets interpolated into bytes as 'xxx' and the > corresponding situation where bytes gets interpolated into a string > as b'xxx'. Why struggle to keep bytes interpolation "pure" if string > interpolation isn't? > > Guido's proposal makes the language more symmetric, and thus more > consistent and less surprising. Exactly the hallmarks of Python's design > sense, IMO. (Big surprise, right? :) > > Of course, this point of view *is* based on the idea that when you are > doing interpolation using %/.format, you are in fact primarily concerned > with ASCII compatible byte streams. This is a Practicality sort of > argument. It is, after all, by far the most common use case when > doing interpolation[*]. > > If you wanted to do a purist version of this symmetry, you'd have bytes(x) > calling __bytes__ if it was defined and falling back to calling a > __brepr__ otherwise. > > But what would __brepr__ implement? The variety of format codes in > the struct module argues that there is no "one obvious" binary > repr for most types. (Those that have one would implement __bytes__). > And what would be the __brepr__ of an arbitrary 'object'? > > Faced with the impracticality of defining __brepr__ usefully in any "pure > bytes" form, it seems sensible to admit that the most useful __brepr__ > is the ascii() encoding of the __repr__. Which naturally produces 'xxx' > as the __brepr__ of a string. > > This does cause things to get a little un-pretty when you are operating > at the python prompt: > > >>> b'%s' % object > b'""' > > But then again that is most likely really not what you mean to do, so > it becomes a big red flag...just like b'xxx' is a small red flag when > you accidentally interpolate unencoded bytes into a string. > > --David > > PS: When I first read Guido's remark that the result of interpolating a > string should be 'xxx', I went Wah? I had to reason my way through to > it as above, but to him it was just the natural answer. Guido isn't > always right, but this kind of automatic language design consistency > is one reason he's the BDFL. > > [*] I still think that you mostly want to design your library so that > you are handling the text parts as text and the bytes parts as bytes, > and encoding/gluing them as appropriate at the IO boundary. But if Guido > says his real code would benefit by being able to interpolate ASCII into > bytes at certain points, I'll believe him. If you think corrupted data is easier or more pleasant to track down than encoding exceptions then I think you are strange. It makes porting really difficult while you are still trying to figure out where the bytes/str boundaries are. I am now deeply suspicious of all % formatting. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, Jan 13, 2014 at 12:31 PM, Antoine Pitrou wrote: > On Mon, 13 Jan 2014 08:36:05 -0800 > Ethan Furman wrote: > > > On 01/13/2014 08:09 AM, Antoine Pitrou wrote: > > > On Mon, 13 Jan 2014 07:59:10 -0800 > > > Guido van Rossum wrote: > > >> On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou > wrote: > > >>> What is the use case for embedding a quoted ASCII-encoded > representation > > >>> in a byte stream? > > >> > > >> It doesn't crash but produces undesired output (always, not only when > > >> the data is non-ASCII) that gives the developer a hint to think about > > >> encoding to bytes. > > > > > > But why is it better to give a hint by producing undesired output > (which > > > may actually go unnoticed for some time and produce issues down the > > > road), rather than simply by raising TypeError? > > > > You mean crash all the time? I'd be fine with that for both the str case > > and the bytes case. But's probably too late > > to change the str case, and the bytes case should mirror what str does. > > Let me add something else: str and bytes don't have to be symmetrical. > In Python 2, str and unicode were symmetrical, they allowed exactly the > same operations and were composable. > In Python 3, str and bytes are different beasts; they have different > operations *and* different semantics (for example, bytes interoperates > with bytearray and memoryview, while str doesn't). > This is also why the int type doesn't have a __bytes__ method (ignoring the use of an integer to bytes()): it's universally defined what str(10) should return, but who know what you want when you would want the bytes of 10 (e.g. base-2, ASCII, UTF-16, etc.). > > So bytes formatting really needn't (and shouldn't, IMO) mirror str > formatting. > I think one of the things about Guido's proposal that bugs me is that it breaks the mental model of the .format() method from str in terms of how the mini-language works. For str.format() you have the conversion and the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The format spec semantically gets passed with the object to format() which calls the object's __format__() method: ``format(number, 'd')``. Now Guido's suggestion has two parts that affect the mini-language for .format(). One is that for bytes.format() the default conversion is bytes() instead of str(), which is fine (probably want to add 'b' as a conversion value as well to be consistent). But the other bit is that the format spec goes from semantically meaning ``format(thing, format_spec)`` to ``format(thing, format_spec).encode('ascii', 'strict')`` for at least numbers. That implicitness bugs me as I have always thought of format specs just leading to a call to format(). I think I can live with it, though, as long as it is **consistently** applied across the board for bytes.format(); every use of a format spec leads to calling ``format(thing, format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would be and it is clearly documented that this is done to ease porting and handle the common case then I can live with it. This even gives people in-place ASCII encoding for strings by always using '{:s}' with text which they can do when they port their code to run under both Python 2 and 3. So you should be able to do ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. If you want more explicit encoding to latin-1 then you need to do it explicitly and not rely on the mini-language to do tricks for you. IOW I want to treat the format mini-language as a language and thus not have any special-casing or massive shifts in meaning between str.format() and bytes.format() so my mental model doesn't have to contort based on whether it's str or bytes. My preference is not have any, but if Guido is going say PBP here then I want absolute consistency across the board in how bytes.format() tweaks things. As for %s for the % operator calling ascii(), I think that will be a porting nightmare of finding out why your bytes suddenly stopped being formatted properly and then having to crawl through all of your code for that one use of %s which is getting bytes in. By raising a TypeError you will very easily detect where your screw-up occurred thanks to the traceback; do so otherwise feels too much like implicit type conversion and ask any JavaScript developer how that can be a bad thing. -Brett > > (the only reason I used "%s" in PEP 460 is to allow a migration path > from 2.x bytes-formatting to 3.x bytes-formatting; in a really "pure" > proposal it would have been called something else) > > Regards > > Antoine. > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/brett%40python.org > ___ Python-D
Re: [Python-Dev] PEP 460 reboot
Am 13.01.2014 18:38, schrieb Ethan Furman: > On 01/13/2014 09:31 AM, Antoine Pitrou wrote: >> On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: >>> >>> You mean crash all the time? I'd be fine with that for both the str >>> case and the bytes case. But's probably too late to change the str case, >>> and the bytes case should mirror what str does. >> >> Let me add something else: str and bytes don't have to be symmetrical. In >> Python 2, str and unicode were symmetrical, they allowed exactly the same >> operations and were composable. In Python 3, str and bytes are different >> beasts; they have different operations *and* different semantics (for >> example, bytes interoperates with bytearray and memoryview, while str >> doesn't). > > This makes sense to me. > > So I'm guess I'm fine with either the quoted ascii repr or the always blowing > up method, with leaning towards the blowing up method. +1. Georg ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On 01/13/2014 09:31 AM, Antoine Pitrou wrote: On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote: You mean crash all the time? I'd be fine with that for both the str case and the bytes case. But's probably too late to change the str case, and the bytes case should mirror what str does. Let me add something else: str and bytes don't have to be symmetrical. In Python 2, str and unicode were symmetrical, they allowed exactly the same operations and were composable. In Python 3, str and bytes are different beasts; they have different operations *and* different semantics (for example, bytes interoperates with bytearray and memoryview, while str doesn't). This makes sense to me. So I'm guess I'm fine with either the quoted ascii repr or the always blowing up method, with leaning towards the blowing up method. -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On January 13, 2014 at 12:45:40 PM, R. David Murray (rdmur...@bitdance.com) wrote: [snip] > There is no use case in the sense you are asking, just like there > is no > real use case for '%s' % b'x' producing "b'x'". But the real use > case > is exactly the same: to let you know your code is screwed up without > actually blowing up with a encoding Exception. Blowing up with an encoding exception is the *only* sane method of making you aware that something is wrong. It’s much better than just keeping producing some broken output, until it gets noticed. What’s the point of writing a piece of software that is working wrong without crashing? > For the record, I like Guido's logic and proposal. I don't understand > Nick's objection, since I don't see the difference between the > situation > here where a string gets interpolated into bytes as 'xxx' and > the > corresponding situation where bytes gets interpolated into > a string > as b'xxx'. Why struggle to keep bytes interpolation "pure" if > string > interpolation isn’t? Isn’t the whole point of this discussion to make python2 people who want to migrate on python3 happier? What’s the point for them to have a ported python2 code that produces "Status: b’42’” for "b’Status: %d’ % 42”? And if you want to call ‘str’ on 42 and then encode the output in latin-1/ascii, then you’re just turning python3 in python2. - Yury ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 460 reboot
On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou wrote: > On Sun, 12 Jan 2014 18:11:47 -0800 > Guido van Rossum wrote: > > On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman wrote: > > > On 01/12/2014 04:47 PM, Guido van Rossum wrote: > > >> %s seems the trickiest: I think with a bytes argument it should just > > >> insert those bytes (and the padding modifiers should work too), and > > >> for other types it should probably work like %a, so that it works as > > >> expected for numeric values, and with a string argument it will return > > >> the ascii()-variant of its repr(). Examples: > > >> > > >> b'%s' % 42 == b'42' > > >> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x' > > >> enclosed in single quotes) > > > > > > I'm not sure about the quotes. Would anyone ever actually want those in > > > the > > > byte stream? > > > > Perhaps not, but it's a hint that you should probably think about an > > encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of > > it as payback time. :-) > > What is the use case for embedding a quoted ASCII-encoded representation > in a byte stream? There is no use case in the sense you are asking, just like there is no real use case for '%s' % b'x' producing "b'x'". But the real use case is exactly the same: to let you know your code is screwed up without actually blowing up with a encoding Exception. For the record, I like Guido's logic and proposal. I don't understand Nick's objection, since I don't see the difference between the situation here where a string gets interpolated into bytes as 'xxx' and the corresponding situation where bytes gets interpolated into a string as b'xxx'. Why struggle to keep bytes interpolation "pure" if string interpolation isn't? Guido's proposal makes the language more symmetric, and thus more consistent and less surprising. Exactly the hallmarks of Python's design sense, IMO. (Big surprise, right? :) Of course, this point of view *is* based on the idea that when you are doing interpolation using %/.format, you are in fact primarily concerned with ASCII compatible byte streams. This is a Practicality sort of argument. It is, after all, by far the most common use case when doing interpolation[*]. If you wanted to do a purist version of this symmetry, you'd have bytes(x) calling __bytes__ if it was defined and falling back to calling a __brepr__ otherwise. But what would __brepr__ implement? The variety of format codes in the struct module argues that there is no "one obvious" binary repr for most types. (Those that have one would implement __bytes__). And what would be the __brepr__ of an arbitrary 'object'? Faced with the impracticality of defining __brepr__ usefully in any "pure bytes" form, it seems sensible to admit that the most useful __brepr__ is the ascii() encoding of the __repr__. Which naturally produces 'xxx' as the __brepr__ of a string. This does cause things to get a little un-pretty when you are operating at the python prompt: >>> b'%s' % object b'""' But then again that is most likely really not what you mean to do, so it becomes a big red flag...just like b'xxx' is a small red flag when you accidentally interpolate unencoded bytes into a string. --David PS: When I first read Guido's remark that the result of interpolating a string should be 'xxx', I went Wah? I had to reason my way through to it as above, but to him it was just the natural answer. Guido isn't always right, but this kind of automatic language design consistency is one reason he's the BDFL. [*] I still think that you mostly want to design your library so that you are handling the text parts as text and the bytes parts as bytes, and encoding/gluing them as appropriate at the IO boundary. But if Guido says his real code would benefit by being able to interpolate ASCII into bytes at certain points, I'll believe him. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com