subject:"Re\: \[Python\-Dev\] PEP 460 reboot"

I am exhausted from all these discussions. I just recommend not
touching those docs.

On Tue, Jan 14, 2014 at 8:08 PM, Jim Jewett  wrote:
> On Tue, Jan 14, 2014 at 3:06 PM, Guido van Rossum  wrote:
>> Personally I wouldn't add any words suggesting or referring to the
>> option of creation another class for this purpose. You wouldn't
>> recommend subclassing dict for constraining the types of keys or
>> values, would you?
>
> Yes, and it is so clear that I suspect I'm missing some context for
> your question.
>
> Do I recommend that each individual application should create new
> concrete classes instead of just using the builtins?  No.
>
> When trying to understand (learn about) the text/binary distinction, I
> do recommend pretending that they are represented by separate classes.
>  Limits on the values in a bytearray are NOT the primary reason for
> this; the primary reason is that operations like the literal
> representation or the capitalize method are arbitrary nonsense unless
> the data happens to be representing ASCII.
>
> sound_sample.capitalize()  -- syntactically valid, but semantic garbage
> header.capitalize() -- OK, which implies that data is an instance
> of something more specific than bytes.
>
> Would I recommend subclassing dict if I wanted to constrain the key
> types?  Yes -- though MutableMapping (fewer gates to guard) or the
> upcoming TransformDict would probably be better still.
>
> The existing dict implementation itself effectively uses (hidden,
> quasi-)subclasses to restrict types of keys strictly for efficiency.
> (lookdict* variants)
>
> -jJ
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/guido%40python.org



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Jim Jewett

On Tue, Jan 14, 2014 at 3:06 PM, Guido van Rossum  wrote:
> Personally I wouldn't add any words suggesting or referring to the
> option of creation another class for this purpose. You wouldn't
> recommend subclassing dict for constraining the types of keys or
> values, would you?

Yes, and it is so clear that I suspect I'm missing some context for
your question.

Do I recommend that each individual application should create new
concrete classes instead of just using the builtins?  No.

When trying to understand (learn about) the text/binary distinction, I
do recommend pretending that they are represented by separate classes.
 Limits on the values in a bytearray are NOT the primary reason for
this; the primary reason is that operations like the literal
representation or the capitalize method are arbitrary nonsense unless
the data happens to be representing ASCII.

sound_sample.capitalize()  -- syntactically valid, but semantic garbage
header.capitalize() -- OK, which implies that data is an instance
of something more specific than bytes.

Would I recommend subclassing dict if I wanted to constrain the key
types?  Yes -- though MutableMapping (fewer gates to guard) or the
upcoming TransformDict would probably be better still.

The existing dict implementation itself effectively uses (hidden,
quasi-)subclasses to restrict types of keys strictly for efficiency.
(lookdict* variants)

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Cameron Simpson

On 14Jan2014 11:43, Jim Jewett  wrote:
> Greg Ewing replied:
> >> ... ASCII compatible binary data is a
> >> *subset* of arbitrary binary data.
> 
> I wrote: [...]
> >(2)  It *may* be worth creating a virtual
> > split in the documentation. [...]
> 
> Ethan likes the idea, but points out that the term
> "Virtual" is confusing here. [...]
> (A)  What word should I use instead of "Virtual"?
> Imaginary?  Pretend?

I'd title it in terms of a common use case, not a "virtual class".
You even phrase the opening sentence as a use case already.

> (B)  Would it be good/bad/at least make the docs
> easier to create an actual class (or alias)?
> (C)  Same question for a pair of classes provided
> only in the documentation, like example code.

I don't think so. People might use it:-(

[...]
> >  A Bytes object could represent anything, [...]

Tiny nit: shouldn't that be "bytes", not "Bytes"?

> >  appropriate as the underlying storage for a sound sample
> >  or image file.
> >
> >  Virtual subclass ASCIIStructuredBytes
> >  

Possible alternate title:

Common use case: bytes containing text sequences, especially ASCII

Cheers,
-- 
Cameron Simpson 

I think... Therefore I ride.  I ride... Therefore I am.
- Mark Pope 
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Glenn Linderman


On 1/14/2014 10:11 AM, Jim J. Jewett wrote:

Virtual subclass ASCIIStructuredBytes


You would first have to define what you meant by a virtual subclass, and 
that somewhere would have to be linked every place you use the term, 
because it is a new term.


Why not just call the sections of the documentation where 
ASCII-supporting features of bytes are discussed "Special ASCII 
support". Calling it that will make it clear that if you are not using 
ASCII, you need to be careful of using the feature... or contrariwise, 
that if you are using the feature, you need to be using ASCII.


While some ASCII supersets may also be usable with the features, I don't 
think that should be emphasized in anyway, unless there is specific 
support for particular ASCII supersets. Using ASCII supersets should be 
"buyer beware".


The whole b"%s" interpolation feature would, appropriately, be described 
in such a section.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Guido van Rossum wrote:

Quite a few people have spoken out in favor of loud
failures rather than silent "wrong" output. But I think that in the
specific context of formatting output, there is a long and IMO good
tradition of producing (slightly) wrong output in favor of more strict
behavior. Consider for example what to do when a number doesn't fit in
the given width. Would you rather raise an exception, truncate the
value, or mess up the formatting?


That depends on the context. If the output is simply a text
file whose lines can grow to accommodate the extra width,
messing up the formatting probably okay.

If it's going into a printed report with a strictly limited
width for each column, and anything that doesn't fit is
going to get graphically clipped away, with no visual
indication that this has happened, it's NOT okay.

If it's going into a text file with defined columns for
each field, which will be read by something that assumes
certain things are in certain columns, it's NOT okay.

If it's going into a binary file as a field consisting
of a length byte followed by some chars, messing up the
formatting is DEFINITELY NOT okay.

This latter kind of situation is the one we're talking
about. If you do something like

   b"%c%s" % (len(data), data)

and data is a str, then the length byte will be correct,
but the data will be (at least) 3 bytes too long. Whatever
reads the file then gets out of step at that point, and
all hell breaks loose.

You do *not* get a nice, easy-to-debug symptom from this
kind of thing. You get "Something is wrong somewhere in
this 50 megabyte jpg file, good luck on finding out what
and why".

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Ethan Furman wrote:

On 01/14/2014 10:11 AM, Jim J. Jewett wrote:



But in terms of explaining the text model, that
separation is important enough that




 (2)  It *may* be worth creating a virtual
  split in the documentation.



I think (2) is a great idea.


I don't think it's such a great idea to belabour this
point.

The notion of an ASCIIStructuredBytes type seems to
assume that you have *either* ascii-encoded text *or*
some other kind of data. But many of the use cases
for all of this involve constructing a single object,
parts of which are one and parts of which are another.
It's hard to think of that in terms of virtual
classes unless you're willing to imagine that different
parts of the same object are of different types,
which, for a primitive object like bytes, doesn't
make sense in the context of the Python object
model.

By all means point out that the ascii features of
bytes are intended for use on data that happens to
be ascii, and shouldn't be used otherwise. But I
think that talking about "virtual classes" just
risks confusing people, particulary when we
have ABCs, which are also a kind of virtual class
represented by real class objects.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Glenn Linderman


On 1/14/2014 4:46 AM, Nick Coghlan wrote:

The one remaining way I could potentially see a formatb method working
is along the lines of what Glenn (I think) suggested: just like struct
definitions, the formatb specifier would have to consist*solely*  of
substitution fields. However, that's getting awfully close to being
just an alternate spelling for the struct module or bytes.join at that
point, which hardly makes for a compelling case to add two new methods
to a builtin type.


Yes, after someone drew the parallel between my "format specifier only" 
pedantry, and struct.pack (which I hadn't used), I agree that they are 
almost just different spellings for the same things.


The two differences I could see is that struct.pack doesn't support 
variable length items, and struct.pack doesn't support "interpolation", 
which is the whole beauty of the % type syntax... the ability to have a 
template, and interpolate values.


My pedantry DID allow for template work, but they had to be specified in 
HEX the way I specified it yesterday.


Let me repeat that syntax:

b"%{hex-codes}v"

That was mostly so the format string could be ASCII, yet represent any 
byte. That is somewhat clunky, when actually wanting to represent 
characters.  At the next level of abstraction, one could define a 
"format builder" that would take Unicode specifications, and "compile" 
them into the binary interpolation strings, but if doing that, you could 
just as well compile them into functions using struct.pack formats, with 
the parameters interspersed with the "template" data, except for 
struct.pack's inability to deal with variable length data.


So struct is attempting to emulate C structs, and variable length data 
is extremely awkward in C structs also, so I guess it provides a good 
emulation :)


So if I were to look for features to add to Python3 to support template 
interpolation for users of non-ASCII encodings, which could, of course, 
also be used by users of ASCII-based encodings, I guess I would recommend:


1) extend struct to handle variable length data items
2) provide a sample format compiler function that would translate a 
Unicode format description into a function that would use struct.pack, 
and pre-encode (according to the format specification) the template 
parts into parameters for struct.pack).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Terry Reedy

Let me answer you both since the issues are related.

On 1/14/2014 7:46 AM, Nick Coghlan wrote:

Guido van Rossum writes:
  > And that is precisely my point. When you're using a format string,

Bytes interpolation uses a bytes format, or a byte string if you will, 
but it should not be thought of as a character or text string. Certain 
bytes (123 and 125) delimit a replacement field. The bytes in between 
define, in my version, a format-spec after being ascii-decoded to text 
for input to 3.x format(). The decoding and subsequent encoding would 
not be needed if 2.7 format(ob, byte-spec) were available.

  > all of the format string (not just the part between { and }) had
  > better use ASCII or an ASCII superset.

I am not even sure what you mean here. The bytes outside of 123 and 125 
are simply copied to the output string. There is no encoding or 
interpretation involved.

It is true that the uninterpred bytes best not contain a byte pattern 
mistakenly recognized as a replacement field. I plan to refine the 
relational expression byte pattern used in byteformat to sharply reduce 
the possibility of such errors. When such errors happen anyway, an 
exception should be raised, and I plan to expand the error message to 
give more diagnostic information.

And this (rightly) constrains the output to an ASCII superset as well.

What does this mean? I suspect I disagree. The bytes interpolated into 
the output bytes can be any bytes.

Except that if you interpolate something like Shift JIS,

Bytes interpolation interpolates bytes, not encodings. A 
self-identifying byte stream starts with bytes in a known encoding that 
specifies the encoding of the rest of the stream. Neither part need be 
encoded text. (Would that something like were standard for encoded text 
streams, as well as for serialized images.)

>> [snip]

Right, that's the danger I was worried about, but the problem is that
there's at least *some* minimum level of ASCII compatibility that
needs to be assumed in order to define an interpolation format at all
(this is the point I originally missed).

I would put this sightly differently. To process bytes, we may define 
certain bytes as metabytes with a special meaning. We may choose the 
bytes that happen to be the ascii encoding of certain characters. But 
once the special numbers are chosen, they are numbers, not characters.

The problem of metabytes having both a normal and special meaning is 
similar to the problem of metacharacters having both a normal and 
special meaning.

For printf-style formatting,
it's % along with the various formatting characters and other syntax
(like digits, parentheses, variable names and "."), with the format
method it's braces, brackets, colons, variable names, etc.

It is the bytes corresponding to these characters. This is true also of 
the metabytes in an re module bytes pattern.

The mini-language parser has to assume in encoding

> in order to interpret the format string,

This is where I disagree with you and Guido. Bytes processing is done 
with numbers 0 <= n <= 255, not characters. The fact that ascii 
characters can, for convenience, be used in bytes literals to indicate 
the corresponding ascii codes does not change this. A bytes parser looks 
for certain special numbers. Other numbers need not be given any 
interpretation and need not represent encoded characters.

> and that's *all* done assuming an ASCII compatible format string

Since any bytes can be be regarded as an ascii-compatible latin-1 
encoded string, that seems like a vacuous assumption. In any case, I do 
not seen any particular assumption in the following, other than the 
choice of replacement field delimiters.

>>> list(byteformat(bytes([1,2,10, 123, 125, 200]),
   (bytes([50, 100, 150]),)))
[1, 2, 10, 50, 100, 150, 200]

> (which must make life interesting if you try to use an

ASCII incompatible coding cookie for your source code - I'm actually
not sure what the full implications of that *are* for bytes literals
in Python 3).

An interesting and important question. The Python 2 manual says that the 
coding cookie applies to only to comments and strings. To me, this 
suggests that any encoding can be used. I am not sure how and when the 
encoding is applied. It suggests that the sequence of bytes resulting 
from a string literal is not determined by the sequence of characters 
comprising the string literal, but also depends on the coding cookie.

The Python 3 manual says that the coding cookie applies to the whole 
source file. To me, this says that the subset of unicode chars included 
in the encoding *must* include the ascii characters. It also suggest to 
me that the encoding must also ascii-compatible, in order to read the 
encoding in the ascii-text coding cookie (unless there is a fallback to 
the system encoding).

In any case, a 3.x source file is decoded to unicode. When the sequence 
of unicode chars comprising a bytes literal is interpreted, the 
re

Re: [Python-Dev] PEP 460 reboot


On 01/14/2014 01:17 PM, Mark Lawrence wrote:

On 14/01/2014 20:54, Guido van Rossum wrote:

On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:


In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
% data_value[:15] everywhere.


Wow. I thought there would be some combination using %.15s but I can't
get that to work. :-(



I believe you wanted this.


a='01234567890123456'
len(a)

17

b = '%15.15s' % a
b;len(b)

'012345678901234'
15


Cool!

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread MRAB

On 2014-01-14 20:54, Guido van Rossum wrote:

On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:

On 01/14/2014 10:52 AM, Guido van Rossum wrote:

Which reminds me. Quite a few people have spoken out in favor of loud
failures rather than silent "wrong" output. But I think that in the
specific context of formatting output, there is a long and IMO good
tradition of producing (slightly) wrong output in favor of more strict
behavior. Consider for example what to do when a number doesn't fit in
the given width. Would you rather raise an exception, truncate the
value, or mess up the formatting?

One more data point to consider:  When the binary format has strict rules on
how much space a data-point is allowed, then failure is the only appropriate
option.

Yes, that's how the struct module works.

In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
% data_value[:15] everywhere.

Wow. I thought there would be some combination using %.15s but I can't
get that to work. :-(

I've not sure what you mean here:

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit 
(AMD64)] on win

32
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> '%.15s' % string.letters
'abcdefghijklmno'
>>> len(_)
15

I'm not suggesting we change how that portion works, as it would then be, I
think, too different from both Py2 behavior as well as current str behavior,
but likewise adding in single quotes would of no help to me.  Loud failure
so I can easily see where I forgot the .encode() would be much more helpful.

If we go with a more restricted version this makes sense indeed. The
single quotes seemed unavoidable when I was trying (like several other
proposals) to have a format code that works for all types. I think
we're rightly giving up on that now.

(I should review PEP 461, but I don't have time yet.)

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/14/2014 01:15 PM, Eric V. Smith wrote:

On 1/14/2014 3:54 PM, Guido van Rossum wrote:

On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:

In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
% data_value[:15] everywhere.


Wow. I thought there would be some combination using %.15s but I can't
get that to work. :-(



'%.15s' % 'abcdefghij1234567'

'abcdefghij12345'

'{:.15}'.format('abcdefghij1234567')

'abcdefghij12345'




Or, depending on what you're after:


'%15.15s' % 'abcde'

'  abcde'

'%15.15s' % 'abcdefghij1234567'

'abcdefghij12345'


Huh.  Wish I'd known about that way back when!  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Barry Warsaw

On Jan 14, 2014, at 10:52 AM, Guido van Rossum wrote:

>Which reminds me. Quite a few people have spoken out in favor of loud
>failures rather than silent "wrong" output. But I think that in the
>specific context of formatting output, there is a long and IMO good
>tradition of producing (slightly) wrong output in favor of more strict
>behavior.

In the email package we now have a tradition of allowing either behavior.

http://docs.python.org/3.4/library/email.policy.html#email.policy.Policy.raise_on_defect

Perhaps not appropriate for the PEP 460 related cases, but I think the policy
mechanism works great for email parsing, where sometimes you definitely want
to fail early (e.g. you are composing new messages out of literal strings) and
other times where you are willing to put up with some best-effort
representation in exchange for no exceptions being raised (e.g. you are
parsing messages being fed to you from your mail server).

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Nick Coghlan wrote:

The
mini-language parser has to assume in encoding in order to interpret
the format string, and that's *all* done assuming an ASCII compatible
format string (which must make life interesting if you try to use an
ASCII incompatible coding cookie for your source code


I don't think it's all *that* interesting. As long as you're
able to type the relevant characters on your keyboard and
have them displayed in a recognisable way in your editor,
then what looks like b"Content-Length: %d" in your source
will end up encoded as ascii in the bytes object, whatever
the encoding of the source file.

If the source file uses an encoding that can't even represent
the formatting characters, then you're in trouble -- but
you'd have a hard time writing Python code at all in such
an environment!


It's certainly a decision that has its downsides, with the potential
impact on users of ASCII incompatible encodings (mostly in Asia) being
the main one,


I don't think it will have much impact on them, other
than maybe they will find less use cases for it. But the
main intended use cases are for things like http headers
which have protocol-mandated ascii-ish bits, and those
bits are still just as ascii-ish in China as they are
anywhere else.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Mark Lawrence

On 14/01/2014 20:54, Guido van Rossum wrote:

On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:

In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
% data_value[:15] everywhere.

Wow. I thought there would be some combination using %.15s but I can't
get that to work. :-(

I believe you wanted this.

>>> a='01234567890123456'
>>> len(a)
17
>>> b = '%15.15s' % a
>>> b;len(b)
'012345678901234'
15

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Eric V. Smith

On 1/14/2014 3:54 PM, Guido van Rossum wrote:
> On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:
>> In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
>> % data_value[:15] everywhere.
> 
> Wow. I thought there would be some combination using %.15s but I can't
> get that to work. :-(

>>> '%.15s' % 'abcdefghij1234567'
'abcdefghij12345'
>>> '{:.15}'.format('abcdefghij1234567')
'abcdefghij12345'
>>>

Or, depending on what you're after:

>>> '%15.15s' % 'abcde'
'  abcde'
>>> '%15.15s' % 'abcdefghij1234567'
'abcdefghij12345'
>>>


> (I should review PEP 461, but I don't have time yet.)

Same here.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Tue, Jan 14, 2014 at 12:13 PM, Ethan Furman  wrote:
> On 01/14/2014 10:52 AM, Guido van Rossum wrote:
>>
>> Which reminds me. Quite a few people have spoken out in favor of loud
>> failures rather than silent "wrong" output. But I think that in the
>> specific context of formatting output, there is a long and IMO good
>> tradition of producing (slightly) wrong output in favor of more strict
>> behavior. Consider for example what to do when a number doesn't fit in
>> the given width. Would you rather raise an exception, truncate the
>> value, or mess up the formatting?
>
> One more data point to consider:  When the binary format has strict rules on
> how much space a data-point is allowed, then failure is the only appropriate
> option.

Yes, that's how the struct module works.

> In Py2, because '%15s' can actually take 17 characters, I have to use '%15s'
> % data_value[:15] everywhere.

Wow. I thought there would be some combination using %.15s but I can't
get that to work. :-(

> I'm not suggesting we change how that portion works, as it would then be, I
> think, too different from both Py2 behavior as well as current str behavior,
> but likewise adding in single quotes would of no help to me.  Loud failure
> so I can easily see where I forgot the .encode() would be much more helpful.

If we go with a more restricted version this makes sense indeed. The
single quotes seemed unavoidable when I was trying (like several other
proposals) to have a format code that works for all types. I think
we're rightly giving up on that now.

(I should review PEP 461, but I don't have time yet.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/14/2014 10:52 AM, Guido van Rossum wrote:


Which reminds me. Quite a few people have spoken out in favor of loud
failures rather than silent "wrong" output. But I think that in the
specific context of formatting output, there is a long and IMO good
tradition of producing (slightly) wrong output in favor of more strict
behavior. Consider for example what to do when a number doesn't fit in
the given width. Would you rather raise an exception, truncate the
value, or mess up the formatting?


One more data point to consider:  When the binary format has strict rules on how much space a data-point is allowed, 
then failure is the only appropriate option.


In Py2, because '%15s' can actually take 17 characters, I have to use '%15s' % 
data_value[:15] everywhere.

I'm not suggesting we change how that portion works, as it would then be, I think, too different from both Py2 behavior 
as well as current str behavior, but likewise adding in single quotes would of no help to me.  Loud failure so I can 
easily see where I forgot the .encode() would be much more helpful.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Tue, Jan 14, 2014 at 12:04 PM, Eric V. Smith  wrote:
> On 01/14/2014 01:52 PM, Guido van Rossum wrote:
>
>> But the way to arrive at this behavior without duplicating a whole lot
>> of code seems to be to call the existing text-based __format__ API and
>> convert the result to bytes -- for numbers this should be safe (their
>> formatting produces just ASCII digits and a selected few other ASCII
>> characters) but leads to an undesirable outcome for other types -- not
>> just str but also e.g. lists or dicts containing str instances, since
>> those call __repr__ on the contained items, and repr() may produce
>> non-ASCII bytes.
>
> That's why I suggested restricting the types supported. If we could live
> with just a subset of known types, then we could hard-code the
> conversions to bytes. How many types with custom __format__'s are really
> getting written to byte strings in 2.x? For that matter, are any lists,
> sets, or dicts (or anything else using object.__format__'s conversion
> using str()) really getting written to bytes? Do we need to support
> these cases?
>
> In my mind, this comes down to: are we trying to add this just to make
> porting easier? In my mind, we wouldn't even be adding feature at all
> except for ease of porting 2.x code. So we should focus on what features
> are used in the code we're trying to port. I don't think our focus is on
> 2.x code that's using u''.format(), it's 2.x code that's been reviewed
> and is still using b''.format() because it's building up bytes for a
> wire protocol. And that code is not likely to need to format objects
> with arbitrary __format__ methods, or even str (in the 3.x sense). It's
> only likely to use numbers and bytes (or str in the 2.x sense).

Yes, these are exactly the right questions to ask.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

Personally I wouldn't add any words suggesting or referring to the
option of creation another class for this purpose. You wouldn't
recommend subclassing dict for constraining the types of keys or
values, would you?

On Tue, Jan 14, 2014 at 11:43 AM, Jim J. Jewett  wrote:
>
>
>
> Greg Ewing replied:
>
>>> ... ASCII compatible binary data is a
>>> *subset* of arbitrary binary data.
>
> I wrote:
>
>> But in terms of explaining the text model, that
>> separation is important enough that
>
>>(2)  It *may* be worth creating a virtual
>> split in the documentation.
>
> (rough sketch below)
>
> Ethan likes the idea, but points out that the term
> "Virtual" is confusing here.
>
> Alas, I'm not sure what the correct term is.  In
> addition to "Go for it!" / "Don't waste your time",
> I'm looking for advice on:
>
> (A)  What word should I use instead of "Virtual"?
> Imaginary?  Pretend?
>
> (B)  Would it be good/bad/at least make the docs
> easier to create an actual class (or alias)?
>
> (C)  Same question for a pair of classes provided
> only in the documentation, like example code.
>
> (D)  What about an abstract class, or several?
>
> e.g., replacing the XXX TODO of collections.abc.ByteString
> with separate abstract classes for ByteSequence, String,
> ByteString, and ASCIIByteString?
>
> (ByteString already includes any bytes or bytearray instance,
> so backward compatibility means the String suffix isn't
> sufficient for an opt-in-by-instances class.)
>
>
>> I'm willing ot work on (2) if there is general consensus
>> that it would be a good idea.  As a rough sketch, I
>> would change places like
>>
>>  http://docs.python.org/3/library/stdtypes.html#typebytes
>>
>> from:
>>
>>  Bytes objects are immutable sequences of single bytes.
>>  Since many major binary protocols are based on the ASCII
>>  text encoding, bytes objects offer several methods that
>>  are only valid when working with ASCII compatible data
>>  and are closely related to string objects in a variety
>>  of other ways.
>>
>> to something more like:
>>
>>  Bytes objects are immutable sequences of single bytes.
>>
>>  A Bytes object could represent anything, and is
>>  appropriate as the underlying storage for a sound sample
>>  or image file.
>>
>>  Virtual subclass ASCIIStructuredBytes
>>  
>>
>>  One particularly common use of bytes is to represent
>>  the contents of a file, or of a network message.  In
>>  these cases, the bytes will often represent Text
>>  *in a specific encoding* and that encoding will usually
>>  be a superset of ASCII.  Rather than create and support
>>  an ASCIIStructuredBytes subclass, Python simply added
>>  support for these use cases straight to Bytes objects,
>>  and assumes that this support simply won't be used when
>>  when it does not make sense. For example, bytes literals
>>  *could* be used to construct a sound sample, but the
>>  literals will be far easier to read when they are used
>>  to represent (encoded) ASCII text, such as "OPEN".
>
>
> -jJ
>
> --
>
> If there are still threading problems with my replies, please
> email me with details, so that I can try to resolve them.  -jJ
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/guido%40python.org



-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Eric V. Smith

On 01/14/2014 01:52 PM, Guido van Rossum wrote:

> But the way to arrive at this behavior without duplicating a whole lot
> of code seems to be to call the existing text-based __format__ API and
> convert the result to bytes -- for numbers this should be safe (their
> formatting produces just ASCII digits and a selected few other ASCII
> characters) but leads to an undesirable outcome for other types -- not
> just str but also e.g. lists or dicts containing str instances, since
> those call __repr__ on the contained items, and repr() may produce
> non-ASCII bytes.

That's why I suggested restricting the types supported. If we could live
with just a subset of known types, then we could hard-code the
conversions to bytes. How many types with custom __format__'s are really
getting written to byte strings in 2.x? For that matter, are any lists,
sets, or dicts (or anything else using object.__format__'s conversion
using str()) really getting written to bytes? Do we need to support
these cases?

In my mind, this comes down to: are we trying to add this just to make
porting easier? In my mind, we wouldn't even be adding feature at all
except for ease of porting 2.x code. So we should focus on what features
are used in the code we're trying to port. I don't think our focus is on
2.x code that's using u''.format(), it's 2.x code that's been reviewed
and is still using b''.format() because it's building up bytes for a
wire protocol. And that code is not likely to need to format objects
with arbitrary __format__ methods, or even str (in the 3.x sense). It's
only likely to use numbers and bytes (or str in the 2.x sense).

Eric.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Jim J. Jewett




Greg Ewing replied:

>> ... ASCII compatible binary data is a
>> *subset* of arbitrary binary data.

I wrote:

> But in terms of explaining the text model, that
> separation is important enough that

>(2)  It *may* be worth creating a virtual
> split in the documentation.

(rough sketch below)

Ethan likes the idea, but points out that the term
"Virtual" is confusing here.

Alas, I'm not sure what the correct term is.  In
addition to "Go for it!" / "Don't waste your time",
I'm looking for advice on:

(A)  What word should I use instead of "Virtual"?
Imaginary?  Pretend?

(B)  Would it be good/bad/at least make the docs
easier to create an actual class (or alias)?

(C)  Same question for a pair of classes provided
only in the documentation, like example code.

(D)  What about an abstract class, or several?

e.g., replacing the XXX TODO of collections.abc.ByteString
with separate abstract classes for ByteSequence, String,
ByteString, and ASCIIByteString?

(ByteString already includes any bytes or bytearray instance,
so backward compatibility means the String suffix isn't
sufficient for an opt-in-by-instances class.)


> I'm willing ot work on (2) if there is general consensus
> that it would be a good idea.  As a rough sketch, I
> would change places like
>
>  http://docs.python.org/3/library/stdtypes.html#typebytes
>
> from:
>
>  Bytes objects are immutable sequences of single bytes.
>  Since many major binary protocols are based on the ASCII
>  text encoding, bytes objects offer several methods that
>  are only valid when working with ASCII compatible data
>  and are closely related to string objects in a variety
>  of other ways.
>
> to something more like:
>
>  Bytes objects are immutable sequences of single bytes.
>
>  A Bytes object could represent anything, and is
>  appropriate as the underlying storage for a sound sample
>  or image file.
>
>  Virtual subclass ASCIIStructuredBytes
>  
>
>  One particularly common use of bytes is to represent
>  the contents of a file, or of a network message.  In
>  these cases, the bytes will often represent Text
>  *in a specific encoding* and that encoding will usually
>  be a superset of ASCII.  Rather than create and support
>  an ASCIIStructuredBytes subclass, Python simply added
>  support for these use cases straight to Bytes objects,
>  and assumes that this support simply won't be used when
>  when it does not make sense. For example, bytes literals
>  *could* be used to construct a sound sample, but the
>  literals will be far easier to read when they are used
>  to represent (encoded) ASCII text, such as "OPEN".


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Daniel Holth

On Tue, Jan 14, 2014 at 1:52 PM, Guido van Rossum  wrote:
> On Tue, Jan 14, 2014 at 9:45 AM, Chris Barker  wrote:
>> On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov 
>> wrote:
>>>
>>>  - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.
>>
>>
>> please no -- that's the source of a lot of pain in py2 now.
>>
>> having a failure as a result of the value, rather than the type, of an
>> object just makes hard-to-test for bugs. Everything will be hunky dory for
>> development and testing, then in deployment some idiot ( ;-) ) will pass in
>> some non-ascii compatible string and you get  failure. And the person who
>> gets the failure doesn't understand why, or they wouldn't have passed in
>> non-ascii values in the first place...
>>
>> Ease of porting is nice, but let's not make it easy to port bug-prone code.
>
> Right. This is a big red flag to me as well.
>
> I think there is some inherent conflict between the extensible design
> of str.format() and the practical needs of people who are actually
> going to use formatting operations (either % or .format()) with bytes.
>
> The *practical* needs are mostly limited to supporting basic number
> formatting (decimal, hex, padding) and interpolation of anything that
> supports the buffer interface. It would also be nice if you didn't
> have to specify the type at all in the format string, i.e. {} should
> do the right thing for numbers and (all sorts of) bytes.
>
> But the way to arrive at this behavior without duplicating a whole lot
> of code seems to be to call the existing text-based __format__ API and
> convert the result to bytes -- for numbers this should be safe (their
> formatting produces just ASCII digits and a selected few other ASCII
> characters) but leads to an undesirable outcome for other types -- not
> just str but also e.g. lists or dicts containing str instances, since
> those call __repr__ on the contained items, and repr() may produce
> non-ASCII bytes.
>
> This is why my earlier proposal used ascii(), which is a "nerfed"(*)
> version of repr(). This does the right thing for numbers as well as
> for many other types (e.g. None, bool) and does something unpleasant
> for text strings that is perhaps better than the alternative.
>
> Which reminds me. Quite a few people have spoken out in favor of loud
> failures rather than silent "wrong" output. But I think that in the
> specific context of formatting output, there is a long and IMO good
> tradition of producing (slightly) wrong output in favor of more strict
> behavior. Consider for example what to do when a number doesn't fit in
> the given width. Would you rather raise an exception, truncate the
> value, or mess up the formatting? All languages newer than Fortran
> that I've used have chosen the latter, and I still agree it's a good
> idea. Similar with infinities, NaN, or None. (Yes, it's embarrassing
> to have a website displaying 'null'. But isn't a 500 even *more*
> embarrassing?)
>
> This doesn't mean I'm insensitive to the argument in favor of loud and
> early failure. It's just that I can see both sides of the coin, and
> I'm still deciding which argument is more important.
>
> (*) Gamer slang for a weapon made less dangerous. :-)

I think loud and early failure is important for porting while you
might still be trying to pound out the previously blurry encode/decode
boundaries. In this code str and bytes will be wrong everywhere. Some
APIs might return either str or bytes based on the input. Let it fail,
find the boundaries, and fix it until it does something useful without
failing. And it kindof depends on the context whether it is worse to
display weird ephemeral output or write the same weird output to long
term storage.

I'm not sure what to think about content-dependent failures on
protocols that are supposed to be ASCII-only-without-repr-noise.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Antoine Pitrou

On Tue, 14 Jan 2014 10:52:05 -0800
Guido van Rossum  wrote:
> Would you rather raise an exception, truncate the
> value, or mess up the formatting? All languages newer than Fortran
> that I've used have chosen the latter, and I still agree it's a good
> idea.

Well that's useful when printing out human-readable stuff on stdout,
much less when you're emitting binary data that's supposed to conform
to a well-defined protocol. I expect bytes formatting to be used for
the latter, not the former.

(which also means, actually, that I don't think the fancy formatting
features - alignment, etc. - are useful at all for bytes; but it's
probably ok having them for consistency)

> Similar with infinities, NaN, or None. (Yes, it's embarrassing
> to have a website displaying 'null'. But isn't a 500 even *more*
> embarrassing?)

When it comes to type mismatch, though, an error is raised:

>>> "%d" % object()
Traceback (most recent call last):
  File "", line 1, in 
TypeError: %d format: a number is required, not object

(instead of outputting e.g. repr(id(x)))

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Terry Reedy


On 1/14/2014 1:11 PM, Jim J. Jewett wrote:


But in terms of explaining the text model, that
separation is important enough that

 (1)  We should be reluctant to strengthen the
  "its really just ASCII" messages.
 (2)  It *may* be worth creating a virtual
  split in the documentation.

I'm willing ot work on (2) if there is general consensus
that it would be a good idea.  As a rough sketch, I
would change places like

 http://docs.python.org/3/library/stdtypes.html#typebytes

from:

 Bytes objects are immutable sequences of single bytes.
 Since many major binary protocols are based on the ASCII
 text encoding, bytes objects offer several methods that
 are only valid when working with ASCII compatible data
 and are closely related to string objects in a variety
 of other ways.

to something more like:

 Bytes objects are immutable sequences of single bytes.

 A Bytes object could represent anything, and is
 appropriate as the underlying storage for a sound sample
 or image file.

 Virtual subclass ASCIIStructuredBytes
 

 One particularly common use of bytes is to represent
 the contents of a file, or of a network message.  In
 these cases, the bytes will often represent Text
 *in a specific encoding* and that encoding will usually
 be a superset of ASCII.  Rather than create and support
 an ASCIIStructuredBytes subclass, Python simply added
 support for these use cases straight to Bytes objects,
 and assumes that this support simply won't be used when
 when it does not make sense. For example, bytes literals
 *could* be used to construct a sound sample, but the
 literals will be far easier to read when they are used
 to represent (encoded) ASCII text, such as "OPEN".


I rather like this. Consider opening a tracker issue.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Tue, Jan 14, 2014 at 9:45 AM, Chris Barker  wrote:
> On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov 
> wrote:
>>
>>  - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.
>
>
> please no -- that's the source of a lot of pain in py2 now.
>
> having a failure as a result of the value, rather than the type, of an
> object just makes hard-to-test for bugs. Everything will be hunky dory for
> development and testing, then in deployment some idiot ( ;-) ) will pass in
> some non-ascii compatible string and you get  failure. And the person who
> gets the failure doesn't understand why, or they wouldn't have passed in
> non-ascii values in the first place...
>
> Ease of porting is nice, but let's not make it easy to port bug-prone code.

Right. This is a big red flag to me as well.

I think there is some inherent conflict between the extensible design
of str.format() and the practical needs of people who are actually
going to use formatting operations (either % or .format()) with bytes.

The *practical* needs are mostly limited to supporting basic number
formatting (decimal, hex, padding) and interpolation of anything that
supports the buffer interface. It would also be nice if you didn't
have to specify the type at all in the format string, i.e. {} should
do the right thing for numbers and (all sorts of) bytes.

But the way to arrive at this behavior without duplicating a whole lot
of code seems to be to call the existing text-based __format__ API and
convert the result to bytes -- for numbers this should be safe (their
formatting produces just ASCII digits and a selected few other ASCII
characters) but leads to an undesirable outcome for other types -- not
just str but also e.g. lists or dicts containing str instances, since
those call __repr__ on the contained items, and repr() may produce
non-ASCII bytes.

This is why my earlier proposal used ascii(), which is a "nerfed"(*)
version of repr(). This does the right thing for numbers as well as
for many other types (e.g. None, bool) and does something unpleasant
for text strings that is perhaps better than the alternative.

Which reminds me. Quite a few people have spoken out in favor of loud
failures rather than silent "wrong" output. But I think that in the
specific context of formatting output, there is a long and IMO good
tradition of producing (slightly) wrong output in favor of more strict
behavior. Consider for example what to do when a number doesn't fit in
the given width. Would you rather raise an exception, truncate the
value, or mess up the formatting? All languages newer than Fortran
that I've used have chosen the latter, and I still agree it's a good
idea. Similar with infinities, NaN, or None. (Yes, it's embarrassing
to have a website displaying 'null'. But isn't a 500 even *more*
embarrassing?)

This doesn't mean I'm insensitive to the argument in favor of loud and
early failure. It's just that I can see both sides of the coin, and
I'm still deciding which argument is more important.

(*) Gamer slang for a weapon made less dangerous. :-)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On 01/14/2014 10:11 AM, Jim J. Jewett wrote:

But in terms of explaining the text model, that
separation is important enough that

(2) It *may* be worth creating a virtual
split in the documentation.

I think (2) is a great idea.

I'm willing ot work on (2) if there is general consensus
that it would be a good idea. As a rough sketch, I
would change places like

http://docs.python.org/3/library/stdtypes.html#typebytes

from:

Bytes objects are immutable sequences of single bytes.
Since many major binary protocols are based on the ASCII
text encoding, bytes objects offer several methods that
are only valid when working with ASCII compatible data
and are closely related to string objects in a variety
of other ways.

to something more like:

Bytes objects are immutable sequences of single bytes.

A Bytes object could represent anything, and is
appropriate as the underlying storage for a sound sample
or image file.

Virtual subclass ASCIIStructuredBytes

One particularly common use of bytes is to represent
the contents of a file, or of a network message. In
these cases, the bytes will often represent Text
*in a specific encoding* and that encoding will usually
be a superset of ASCII. Rather than create and support
an ASCIIStructuredBytes subclass, Python simply added
support for these use cases straight to Bytes objects,
and assumes that this support simply won't be used when
when it does not make sense. For example, bytes literals
*could* be used to construct a sound sample, but the
literals will be far easier to read when they are used
to represent (encoded) ASCII text, such as "OPEN".

I find the Virtual subclass in the title to be confusing, but I otherwise it's great. We should have that even if we do
add formatting to bytes, as that message is even more important then.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Chris Barker

On Tue, Jan 14, 2014 at 9:29 AM, Yury Selivanov wrote:

>  - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.
>

please no -- that's the source of a lot of pain in py2 now.

having a failure as a result of the value, rather than the type, of an
object just makes hard-to-test for bugs. Everything will be hunky dory for
development and testing, then in deployment some idiot ( ;-) ) will pass in
some non-ascii compatible string and you get  failure. And the person who
gets the failure doesn't understand why, or they wouldn't have passed in
non-ascii values in the first place...

Ease of porting is nice, but let's not make it easy to port bug-prone code.

-Chris












>
> This way *most* of the use cases of python2 will be covered without
> touching the code. So:
>
>  - b’Hello {}’.format(‘world’)
>will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’)
>
>  - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError
>
>  - b’Status: {}’.format(200)
>will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’)
>
>  - b’Hello %s’ % (‘world’,) - the same as the first example
>
>  - b’Connection: {}’.format(b’keep-alive’) - works
>
>  - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept
>
> I think it’s OK to check the buffers for ASCII-subset only. Yes, it
> will have some sort of sub-optimal performance, but then, it’s quite
> rare when string formatting is used to concatenate huge buffers.
>
> 2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and
> Py_buffer.
>
> --
> Yury Selivanov
>
> On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote:
> >
> > On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum
> > wrote:
> >
> > > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon
> > wrote:
> > > > I have been going on the assumption that bytes.format() would
> > change what
> > > > '{}' meant for itself and would only interpolate bytes. That
> > convenient
> > > > between Python 2 and 3 since it represents what we want it to
> > (str and
> > > bytes
> > > > under the hood, respectively), so it just falls through. We
> > could also
> > > add a
> > > > 'b' conversion for bytes() explicitly so as to help people
> > not
> > > accidentally
> > > > mix up things in bytes.format() and str.format(). But I was
> > not
> > > suggesting
> > > > adding a specific format spec for bytes but instead making
> > bytes.format()
> > > > just do the .encode('ascii') automatically to help with compatibility
> > > when a
> > > > format spec was present. If people want fancy formatting for
> > bytes they
> > > can
> > > > always do it themselves before calling bytes.format().
> > >
> > > This seems hastily written (e.g. verb missing :-), and I'm not
> > clear
> > > on what you are (or were) actually proposing. When exactly would
> > > bytes.format() need .encode('ascii')?
> > >
> > > I would be happy to wait a few hours or days for you to to write it
> > up
> > > clearly, rather than responding in a hurry.
> >
> >
> > Sorry about that. Busy day at work + trying to stay on top of this
> > entire
> > conversation was a bit tough. Let me try to lay out what I'm suggesting
> > for
> > bytes.format() in terms of how it changes
> > http://docs.python.org/3/library/string.html#format-string-syntax
> > for bytes.
> >
> > 1. New conversion operator of 'b' that operates as PEP 460 specifies
> > (i.e.
> > tries to get a buffer, else calls __bytes__). The default conversion
> > changes from 's' to 'b'.
> > 2. Use of the conversion field adds an added step of calling
> > str.encode('ascii', 'strict') on the result returned from
> > calling
> > __format__().
> >
> > That's it. So point 1 means that the following would work in Python
> > 3.5::
> >
> > b'Hello, {}, how are you?'.format(b'Guido')
> > b'Hello, {!b}, how are you?'.format(b'Guido')
> >
> > It would produce an error if you used a text argument for 'Guido'
> > since str
> > doesn't define __bytes__ or a buffer. That gives the EIBTI group
> > their
> > bytes.format() where nothing magical happens.
> >
> > For point 2, let's say you have the following in Python 2::
> >
> > 'I have {} bottles of beer on the wall'.format(10)
> >
> > Under my proposal, how would you change it to get the same result
> > in Python
> > 2 and 3?::
> >
> > b'I have {:d} bottles of beer on the wall'.format(10)
> >
> > In Python 2 you're just being more explicit about the format,
> > otherwise
> > it's the same semantics as today. In Python 3, though, this would
> > translate
> > into (under the hood)::
> >
> > b'I have {} bottles of beer on the wall'.format(format(10,
> > 'd').encode('ascii', 'strict'))
> >
> > This leads to the same bytes value in Python 2 (since it's just
> > a string)
> > and in Python 3 (as everything accepted by bytes.format() is
> > either bytes
> > already or converted to from encoding to ASCII bytes). While
> > Python 2 users
> > would need to make sure they used a format spec to get the same result
> >

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Jim J. Jewett

Nick Coghlan wrote:
>> Arbitrary binary data and ASCII  compatible binary data are *different 
>> things* and the only argument in favour of modelling them with a single 
>> type is because Python 2 did it that way.

Greg Ewing replied:

> I would say that ASCII compatible binary data is a
> *subset* of arbitrary binary data. As such, a type
> designed for arbitrary binary data is a perfectly good
> way of representing ASCII compatible binary data.

But not when you care about the ASCII-compatible part;
then you should use a subclass.

Obviously, it is too late for separating bytes from
AsciiStructuredBytes.  PBP *may* even mean that just
using the "subclass" for everything (and just the
ignoring the ASCII specific methods when they aren't
appropriate) was always the right implementation choice.

But in terms of explaining the text model, that
separation is important enough that

(1)  We should be reluctant to strengthen the
 "its really just ASCII" messages.
(2)  It *may* be worth creating a virtual
 split in the documentation.

I'm willing ot work on (2) if there is general consensus
that it would be a good idea.  As a rough sketch, I
would change places like

http://docs.python.org/3/library/stdtypes.html#typebytes

from:

Bytes objects are immutable sequences of single bytes.
Since many major binary protocols are based on the ASCII
text encoding, bytes objects offer several methods that
are only valid when working with ASCII compatible data
and are closely related to string objects in a variety
of other ways.

to something more like:

Bytes objects are immutable sequences of single bytes.

A Bytes object could represent anything, and is
appropriate as the underlying storage for a sound sample
or image file.

Virtual subclass ASCIIStructuredBytes

One particularly common use of bytes is to represent
the contents of a file, or of a network message.  In
these cases, the bytes will often represent Text
*in a specific encoding* and that encoding will usually
be a superset of ASCII.  Rather than create and support
an ASCIIStructuredBytes subclass, Python simply added
support for these use cases straight to Bytes objects,
and assumes that this support simply won't be used when
when it does not make sense. For example, bytes literals
*could* be used to construct a sound sample, but the
literals will be far easier to read when they are used
to represent (encoded) ASCII text, such as "OPEN". 

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Yury Selivanov

On January 14, 2014 at 12:47:35 PM, Brett Cannon (br...@python.org) wrote:
>  
> On Tue, Jan 14, 2014 at 12:29 PM, Yury Selivanov wrote:  
>  
> > Brett,
> >
> >
> > I like your proposal. There is one idea I have that could,
> > perhaps, improve it:
> >
> >
> > 1. “%s" and “{}” will continue to work for bytes and bytearray  
> in
> > the following fashion:
> >
> > - check if __bytes__/Py_buffer supported.
> > - if it is, check that the bytes are strictly in the printable  
> > ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc).
> > Throw an error if the check fails. If not - concatenate.
> > - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.  
>  
>  
> >
> > This way *most* of the use cases of python2 will be covered without  
> > touching the code. So:
> >
>  
> See, I'm fine with having people update their format strings  
> to specify a
> format spec; it's minor and isn't totally useless as it expresses  
> what they
> mean more explicitly (e.g. "I want this to be a int, I want this  
> to be a
> float, and I want this to be an ASCII string" using d, f, and s,
> respectively). I want people to have to make a conscious decision  
> to fall
> back on an ASCII encoding. What you are suggesting is for people  
> have to
> make a conscious decision **not** to encode to ASCII implicitly  
> which is
> what I'm trying to avoid with this proposal. My goal is to make  
> it easy to
> work with ASCII but as an explicit choice to, not by default.


I understand.  But OTOH, this whole discussion started because of 
the lack of convenience to work with bytes in py3, plus it’s hard
to maintain *same* codebase.  Updating the code to include new
‘%b’ operators won’t help them.

My proposal is based on the assumption, that most of the string
formatting people usually use in python2 on ‘str’ (not ‘unicode’)
is used for ascii. That’s the implicit convenience of using
bytes that everybody is looking for in py3. It allows having
single codebase, and provides the necessary safety.

Anyways, my 2 cents.

Thank you,
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Brett Cannon

On Tue, Jan 14, 2014 at 12:29 PM, Yury Selivanov wrote:

> Brett,
>
>
> I like your proposal.  There is one idea I have that could,
> perhaps, improve it:
>
>
> 1. “%s" and “{}” will continue to work for bytes and bytearray in
> the following fashion:
>
>  - check if __bytes__/Py_buffer supported.
>  - if it is, check that the bytes are strictly in the printable
>ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc).
>Throw an error if the check fails. If not - concatenate.
>  - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.


>
> This way *most* of the use cases of python2 will be covered without
> touching the code. So:
>

See, I'm fine with having people update their format strings to specify a
format spec; it's minor and isn't totally useless as it expresses what they
mean more explicitly (e.g. "I want this to be a int, I want this to be a
float, and I want this to be an ASCII string" using d, f, and s,
respectively). I want people to have to make a conscious decision to fall
back on an ASCII encoding. What you are suggesting is for people have to
make a conscious decision **not** to encode to ASCII implicitly which is
what I'm trying to avoid with this proposal. My goal is to make it easy to
work with ASCII but as an explicit choice to, not by default.

-Brett


>  - b’Hello {}’.format(‘world’)
>will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’)
>
>  - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError
>
>  - b’Status: {}’.format(200)
>will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’)
>
>  - b’Hello %s’ % (‘world’,) - the same as the first example
>
>  - b’Connection: {}’.format(b’keep-alive’) - works
>
>  - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept
>
> I think it’s OK to check the buffers for ASCII-subset only. Yes, it
> will have some sort of sub-optimal performance, but then, it’s quite
> rare when string formatting is used to concatenate huge buffers.


> 2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and
> Py_buffer.
>
> --
> Yury Selivanov
>
> On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote:
> >
> > On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum
> > wrote:
> >
> > > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon
> > wrote:
> > > > I have been going on the assumption that bytes.format() would
> > change what
> > > > '{}' meant for itself and would only interpolate bytes. That
> > convenient
> > > > between Python 2 and 3 since it represents what we want it to
> > (str and
> > > bytes
> > > > under the hood, respectively), so it just falls through. We
> > could also
> > > add a
> > > > 'b' conversion for bytes() explicitly so as to help people
> > not
> > > accidentally
> > > > mix up things in bytes.format() and str.format(). But I was
> > not
> > > suggesting
> > > > adding a specific format spec for bytes but instead making
> > bytes.format()
> > > > just do the .encode('ascii') automatically to help with compatibility
> > > when a
> > > > format spec was present. If people want fancy formatting for
> > bytes they
> > > can
> > > > always do it themselves before calling bytes.format().
> > >
> > > This seems hastily written (e.g. verb missing :-), and I'm not
> > clear
> > > on what you are (or were) actually proposing. When exactly would
> > > bytes.format() need .encode('ascii')?
> > >
> > > I would be happy to wait a few hours or days for you to to write it
> > up
> > > clearly, rather than responding in a hurry.
> >
> >
> > Sorry about that. Busy day at work + trying to stay on top of this
> > entire
> > conversation was a bit tough. Let me try to lay out what I'm suggesting
> > for
> > bytes.format() in terms of how it changes
> > http://docs.python.org/3/library/string.html#format-string-syntax
> > for bytes.
> >
> > 1. New conversion operator of 'b' that operates as PEP 460 specifies
> > (i.e.
> > tries to get a buffer, else calls __bytes__). The default conversion
> > changes from 's' to 'b'.
> > 2. Use of the conversion field adds an added step of calling
> > str.encode('ascii', 'strict') on the result returned from
> > calling
> > __format__().
> >
> > That's it. So point 1 means that the following would work in Python
> > 3.5::
> >
> > b'Hello, {}, how are you?'.format(b'Guido')
> > b'Hello, {!b}, how are you?'.format(b'Guido')
> >
> > It would produce an error if you used a text argument for 'Guido'
> > since str
> > doesn't define __bytes__ or a buffer. That gives the EIBTI group
> > their
> > bytes.format() where nothing magical happens.
> >
> > For point 2, let's say you have the following in Python 2::
> >
> > 'I have {} bottles of beer on the wall'.format(10)
> >
> > Under my proposal, how would you change it to get the same result
> > in Python
> > 2 and 3?::
> >
> > b'I have {:d} bottles of beer on the wall'.format(10)
> >
> > In Python 2 you're just being more explicit about the format,
> > otherwise
> > it's the same sem

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Yury Selivanov

Brett,


I like your proposal.  There is one idea I have that could,
perhaps, improve it:


1. “%s" and “{}” will continue to work for bytes and bytearray in
the following fashion:

 - check if __bytes__/Py_buffer supported.
 - if it is, check that the bytes are strictly in the printable 
   ASCII-subset (a-z, A-Z, 0-9 + special symbols like ! etc).
   Throw an error if the check fails. If not - concatenate.
 - Try str(), and do ".encode(‘ascii’, ‘stcict’)” on the result.


This way *most* of the use cases of python2 will be covered without
touching the code. So:

 - b’Hello {}’.format(‘world’) 
   will be the same as b’hello ‘ + str(‘world’).encode(‘ascii’, ‘strict’)

 - b’Hello {}’.format(‘\u0394’) will throw UnicodeEncodeError

 - b’Status: {}’.format(200)
   will be the same as b’Status: ‘ + str(200).encode(‘ascii’, ‘strict’)

 - b’Hello %s’ % (‘world’,) - the same as the first example

 - b’Connection: {}’.format(b’keep-alive’) - works

 - b’Hello %s’ % (b'\xce\x94’,) - will fail, not ASCII subset we accept

I think it’s OK to check the buffers for ASCII-subset only. Yes, it
will have some sort of sub-optimal performance, but then, it’s quite
rare when string formatting is used to concatenate huge buffers.

2. new operators {!b} and %b. This ones will just use ‘__bytes__’ and 
Py_buffer.

--  
Yury Selivanov

On January 14, 2014 at 11:31:51 AM, Brett Cannon (br...@python.org) wrote:
>  
> On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum  
> wrote:
>  
> > On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon  
> wrote:
> > > I have been going on the assumption that bytes.format() would  
> change what
> > > '{}' meant for itself and would only interpolate bytes. That  
> convenient
> > > between Python 2 and 3 since it represents what we want it to  
> (str and
> > bytes
> > > under the hood, respectively), so it just falls through. We  
> could also
> > add a
> > > 'b' conversion for bytes() explicitly so as to help people  
> not
> > accidentally
> > > mix up things in bytes.format() and str.format(). But I was  
> not
> > suggesting
> > > adding a specific format spec for bytes but instead making  
> bytes.format()
> > > just do the .encode('ascii') automatically to help with compatibility  
> > when a
> > > format spec was present. If people want fancy formatting for  
> bytes they
> > can
> > > always do it themselves before calling bytes.format().
> >
> > This seems hastily written (e.g. verb missing :-), and I'm not  
> clear
> > on what you are (or were) actually proposing. When exactly would  
> > bytes.format() need .encode('ascii')?
> >
> > I would be happy to wait a few hours or days for you to to write it  
> up
> > clearly, rather than responding in a hurry.
>  
>  
> Sorry about that. Busy day at work + trying to stay on top of this  
> entire
> conversation was a bit tough. Let me try to lay out what I'm suggesting  
> for
> bytes.format() in terms of how it changes
> http://docs.python.org/3/library/string.html#format-string-syntax  
> for bytes.
>  
> 1. New conversion operator of 'b' that operates as PEP 460 specifies  
> (i.e.
> tries to get a buffer, else calls __bytes__). The default conversion  
> changes from 's' to 'b'.
> 2. Use of the conversion field adds an added step of calling
> str.encode('ascii', 'strict') on the result returned from  
> calling
> __format__().
>  
> That's it. So point 1 means that the following would work in Python  
> 3.5::
>  
> b'Hello, {}, how are you?'.format(b'Guido')
> b'Hello, {!b}, how are you?'.format(b'Guido')
>  
> It would produce an error if you used a text argument for 'Guido'  
> since str
> doesn't define __bytes__ or a buffer. That gives the EIBTI group  
> their
> bytes.format() where nothing magical happens.
>  
> For point 2, let's say you have the following in Python 2::
>  
> 'I have {} bottles of beer on the wall'.format(10)
>  
> Under my proposal, how would you change it to get the same result  
> in Python
> 2 and 3?::
>  
> b'I have {:d} bottles of beer on the wall'.format(10)
>  
> In Python 2 you're just being more explicit about the format,  
> otherwise
> it's the same semantics as today. In Python 3, though, this would  
> translate
> into (under the hood)::
>  
> b'I have {} bottles of beer on the wall'.format(format(10,
> 'd').encode('ascii', 'strict'))
>  
> This leads to the same bytes value in Python 2 (since it's just  
> a string)
> and in Python 3 (as everything accepted by bytes.format() is  
> either bytes
> already or converted to from encoding to ASCII bytes). While  
> Python 2 users
> would need to make sure they used a format spec to get the same result  
> in
> both Python 2 and 3 for ASCII bytes, it's a minor change which also  
> makes
> the format more explicit so it's not an inherently bad thing.  
> And for those
> that don't want to utilize the automatic ASCII encoding they  
> can just not
> use a format spec in the format string and just pass in bytes directly  
> (i.e. call __format__() themselves and

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Brett Cannon

On Mon, Jan 13, 2014 at 5:14 PM, Guido van Rossum  wrote:

> On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon  wrote:
> > I have been going on the assumption that bytes.format() would change what
> > '{}' meant for itself and would only interpolate bytes. That convenient
> > between Python 2 and 3 since it represents what we want it to (str and
> bytes
> > under the hood, respectively), so it just falls through. We could also
> add a
> > 'b' conversion for bytes() explicitly so as to help people not
> accidentally
> > mix up things in bytes.format() and str.format(). But I was not
> suggesting
> > adding a specific format spec for bytes but instead making bytes.format()
> > just do the .encode('ascii') automatically to help with compatibility
> when a
> > format spec was present. If people want fancy formatting for bytes they
> can
> > always do it themselves before calling bytes.format().
>
> This seems hastily written (e.g. verb missing :-), and I'm not clear
> on what you are (or were) actually proposing. When exactly would
> bytes.format() need .encode('ascii')?
>
> I would be happy to wait a few hours or days for you to to write it up
> clearly, rather than responding in a hurry.

Sorry about that. Busy day at work + trying to stay on top of this entire
conversation was a bit tough. Let me try to lay out what I'm suggesting for
bytes.format() in terms of how it changes
http://docs.python.org/3/library/string.html#format-string-syntax for bytes.

1. New conversion operator of 'b' that operates as PEP 460 specifies (i.e.
tries to get a buffer, else calls __bytes__). The default conversion
changes from 's' to 'b'.
2. Use of the conversion field adds an added step of calling
str.encode('ascii', 'strict') on the result returned from calling
__format__().

That's it. So point 1 means that the following would work in Python 3.5::

  b'Hello, {}, how are you?'.format(b'Guido')
  b'Hello, {!b}, how are you?'.format(b'Guido')

It would produce an error if you used a text argument for 'Guido' since str
doesn't define __bytes__ or a buffer. That gives the EIBTI group their
bytes.format() where nothing magical happens.

For point 2, let's say you have the following in Python 2::

  'I have {} bottles of beer on the wall'.format(10)

Under my proposal, how would you change it to get the same result in Python
2 and 3?::

  b'I have {:d} bottles of beer on the wall'.format(10)

In Python 2 you're just being more explicit about the format, otherwise
it's the same semantics as today. In Python 3, though, this would translate
into (under the hood)::

  b'I have {} bottles of beer on the wall'.format(format(10,
'd').encode('ascii', 'strict'))

This leads to the same bytes value in Python 2 (since it's just a string)
and in Python 3 (as everything accepted by bytes.format() is either bytes
already or converted to from encoding to ASCII bytes). While Python 2 users
would need to make sure they used a format spec to get the same result in
both Python 2 and 3 for ASCII bytes, it's a minor change which also makes
the format more explicit so it's not an inherently bad thing. And for those
that don't want to utilize the automatic ASCII encoding they can just not
use a format spec in the format string and just pass in bytes directly
(i.e. call __format__() themselves and then call str.encode() on their
own). So PBP people get to have a simple way to use bytes.format() in
Python 2 and 3 when dealing with things that can be represented as ASCII
(just as the bytes methods allow for currently).

I think this covers your desire to have numbers and anything else that can
be represented as ASCII be supported for easy porting while covering my
desire that any automatic encoding is clearly explicit in the format string
and in no way special-cased for only some types (the introduction of a 'c'
converter from PEP 460 is also fine with me).

How you would want to translate this proposal with the % operator I'm not
sure since it has been quite a while since I last seriously used it and so
I don't think I'm in a good position to propose a shift for it.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Nick Coghlan

On 14 January 2014 19:54, Stephen J. Turnbull  wrote:
> Guido van Rossum writes:
>  > And that is precisely my point. When you're using a format string,
>  > all of the format string (not just the part between { and }) had
>  > better use ASCII or an ASCII superset. And this (rightly)
>  > constrains the output to an ASCII superset as well.
>
> Except that if you interpolate something like Shift JIS, much of the
> ASCII really isn't ASCII.  That's a general issue, of course, if you
> do something that requires iterated format strings, but it's far more
> likely to appear to work most of the time with those encodings.
>
> Of course you can say "if it hurts, don't do that", but 

Right, that's the danger I was worried about, but the problem is that
there's at least *some* minimum level of ASCII compatibility that
needs to be assumed in order to define an interpolation format at all
(this is the point I originally missed). For printf-style formatting,
it's % along with the various formatting characters and other syntax
(like digits, parentheses, variable names and "."), with the format
method it's braces, brackets, colons, variable names, etc. The
mini-language parser has to assume in encoding in order to interpret
the format string, and that's *all* done assuming an ASCII compatible
format string (which must make life interesting if you try to use an
ASCII incompatible coding cookie for your source code - I'm actually
not sure what the full implications of that *are* for bytes literals
in Python 3).

The one remaining way I could potentially see a formatb method working
is along the lines of what Glenn (I think) suggested: just like struct
definitions, the formatb specifier would have to consist *solely* of
substitution fields. However, that's getting awfully close to being
just an alternate spelling for the struct module or bytes.join at that
point, which hardly makes for a compelling case to add two new methods
to a builtin type.

Given that one of the concepts with the Python 3 transition was to
take certain problematic constructs (like ASCII compatible
interpolation directly to binary without a separate encoding step)
away and decide whether or not we were happy to live without them, I
think this one has proven to have sufficient staying power to finally
bring it back in Python 3.5 (especially given the gain in lowering the
barrier to porting Python 2 code that makes heavy use of interpolation
to ASCII compatible binary formats).

It's certainly a decision that has its downsides, with the potential
impact on users of ASCII incompatible encodings (mostly in Asia) being
the main one, but I think the increased convenience in working with
ASCII compatible binary protocols and file formats is worth the cost.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Stephen J. Turnbull

Guido van Rossum writes:

 > Of course, nobody in their right mind would use a format string
 > containing UTF-16 or EBCDIC.

How about Shift JIS and Big 5 (traditionally "mandated by Microsoft"
in their respective regions, with Shift JIS still overwhelmingly
popular) and GB* ("GB18030 is not just a good idea, It's The Law")?
Are the Japanese and Chinese crazy by definition?  This is where I get
the willies -- not that you think anybody is crazy by definition, but
because I personally have to live with people who use crazy encodings
for interoperability reasons, in fact about half the text I process
daily for work is in those encodings.

Anyway, the thought makes me shiver.  GB2312 text may be encoded as
EUC-CN, in which case it is ASCII-compatible, so no problem.  I'm not
sure if that's the encoding typically denoted by "GB2312" in email,
though, and in any case it's irrelevant as most emails claiming
"charset=GB2312" I receive nowadays include characters from the
extension repertoires of GBK or GB18030.  Shift JIS, Big 5, and GBK
manage to avoid non-ASCII-compatible use of all characters significant
in Python %-formatting (yay!), but .format is right out because {} are
used.  GB18030 in principle uses far more of the code space, including
all of the syntactically significant punctuation, but in practice I
don't know how many of those characters are actually assigned, let
alone used.

 > And that is precisely my point. When you're using a format string,
 > all of the format string (not just the part between { and }) had
 > better use ASCII or an ASCII superset. And this (rightly)
 > constrains the output to an ASCII superset as well.

Except that if you interpolate something like Shift JIS, much of the
ASCII really isn't ASCII.  That's a general issue, of course, if you
do something that requires iterated format strings, but it's far more
likely to appear to work most of the time with those encodings.

Of course you can say "if it hurts, don't do that", but 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-14 Thread Terry Reedy


On 1/14/2014 12:03 AM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 6:25 PM, Terry Reedy  wrote:



byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))

b'\x00\x01\x02abcdef'

re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias
is the one already present is the representation of bytes, and the fact that
Python code must have an ascii-compatible encoding.


I don't think it's that easy. Just searching for '{' is enough to
break in surprising ways


I see your point. The punning problem (between a byte being both itself 
and a special indicator character) is worse with bytes formats than the 
similar pun with text, and the potential for mysterious bugs greater. 
(This is related to why we split 'text' and 'bytes' to begin with.)


With text, we break the pun by doubling the character to escape the 
special meaning. This works because, 1) % and { are relatively rare in 
text, 2) %% and {{ are grammatically incorrect, 3) %, {, and especially 
%% and {{ stand out visually.


With bytes, 1) there is no reason why 37 (%) and 123 ({) should be rare, 
2) there is no grammatical rule against the sequences 37, 37 or 123, 
123, and 3) hex escapes \x25 and \x7b, which might appear in a bytes 
format, do not stand out as needing doubling.


My example above breaks if b'\x00' is replaced with b'\x7b'. Even if a 
doubling and undoubling rule were added, re.split could not be used to 
split the format bytes.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Glenn Linderman wrote:
A mechanism could be defined where 
"format string" would only contain format specifications, and any other 
text would be considered an error.


Someone already did -- it's called struct.pack(). :-)

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 9:25 PM, Nick Coghlan wrote:

since this observation makes it clear that there's*no*  coherent way
to offer a pure binary interpolation API - the only general purpose
combination mechanism for segments of binary data that can avoid
making assumptions about the encodings of metacharacters is simple
concatenation.
That's almost true, and I'm glad that you, Guido, and all of us can 
understand that the currently defined python2 and python3 formatting 
syntaxes contain an inherent ASCII assumption, just like many internet 
protocols. The bitter fight is over :)


However, your statement above isn't 100% accurate, so just for the 
pedantry of it, I'll point out why. A mechanism could be defined where 
"format string" would only contain format specifications, and any other 
text would be considered an error. The format string could have an 
explicit or a defined encoding, there would be no need to make an 
assumption about its encoding. And since it would not contain text 
except for format specifications, it would only be used as a rule-book 
on how to interpret the parameters, contributing no text of its own to 
the result.


This wouldn't solve the problem at hand, though, which is to provide a 
nice migration path from Python 2 to Python 3 for code that uses 
ASCII-based format strings that do contribute text as well as include 
parameter data.


Whether such a technique would be more useful than simple concatenation 
(or complex concatenation such as join) remains to be seen, and possibly 
discussed, if anyone is interested, but it probably would belong on 
python-ideas, since it would not address an immediate porting issue.


Assuming an ASCII-in-bytes format string (but with no contributed text 
to the result) one could write something like


b"%{koi7}s%{00}v%{big5}d%{00}v%{ShiftJIS}s%{}v%b" / ( cyrillic, len( 
blob ), japanese, blob )


So the encodings to be applied to each of the input parameters could be 
explicitly specified.


The %{00}v stuff would be interpolated into the output... expressed in 
ASCII as hex, two characters per byte.  Note that the number uses 
Chinese digits in the big5 encoding, but I don't know if the Chinese 
even use their own digits or ASCII ones these days, or what base they 
use, I guess it was the Babylonians that used base 60 from which our 
timekeeping and angular measures were derived. The example shows a null 
byte or two between items in the output.


So there _could be_ a coherent way to offer an interpolation mechanism 
that is pure binary, and allows selection of encoding of str data, if 
and as needed.  One specifier could even be an encoding to apply to any 
format specifiers that don't include an encoding, so in the typical case 
of dealing with a single language output, the appropriate encoding could 
be set at the beginning of the format specification and overridden by 
particular specifiers if need be. But while there _could be_ such an 
interpolation mechanism, it isn't compatible with Python 2, and the jury 
hasn't decided whether such a thing is sufficiently more useful than 
concatenation to be worth implementing.  A different operator might be 
required, or the whole thing could be a function instead of an operator, 
with a similar format specification, or one more like the minilanguage 
used with format in python 3.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan

On 14 January 2014 15:03, Guido van Rossum  wrote:
> I don't think it's that easy. Just searching for '{' is enough to
> break in surprising ways unless the format string is encoded in an
> ASCII superset. I can think of two easy examples to illustrate this
> (they're similar to the example I posted here before about the
> essential ASCII-ness of %c).
>
> First, let's consider EBCDIC. The '{' character in ASCII is hex 7B
> (decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC)
> and that is the '#' character in EBCDIC. Surprised yet?
>
> Next, let's consider UTF-16. This encoding uses two bytes per
> character (except for surrogates), so any character whose top half or
> bottom half happens to be 7B hex will cause an incorrect hit for your
> regular expression. Ouch.
>
> Of course, nobody in their right mind would use a format string
> containing UTF-16 or EBCDIC. And that is precisely my point. When
> you're using a format string, all of the format string (not just the
> part between { and }) had better use ASCII or an ASCII superset. And
> this (rightly) constrains the output to an ASCII superset as well.

In case it got lost amongst the various threads, this was the argument
that finally convinced me that interpolation *inherently* assumes an
ASCII compatible encoding: the assumption of ASCII compatibility is
embedded in the design of the formatting syntax for both printf-style
formatting and the format methods. That places interpolation support
squarely in the same category as all the other bytes methods that
inherently assume ASCII, and thus remains consistent with the Python 3
text model.

Originally I was thinking that the ASCII assumption applied only if
one of the passed in *values* needed to be implicitly encoded as
ASCII, without accounting for the fact that the parser itself assumed
ASCII compatibility when searching for formatting metacharacters. Once
Guido pointed out that oversight on my part, my objections collapsed,
since this observation makes it clear that there's *no* coherent way
to offer a pure binary interpolation API - the only general purpose
combination mechanism for segments of binary data that can avoid
making assumptions about the encodings of metacharacters is simple
concatenation.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 9:03 PM, Guido van Rossum wrote:

Of course, nobody in their right mind would use a format string
containing UTF-16 or EBCDIC. And that is precisely my point. When
you're using a format string, all of the format string (not just the
part between { and }) had better use ASCII or an ASCII superset. And
this (rightly) constrains the output to an ASCII superset as well.

+1000
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 6:25 PM, Terry Reedy  wrote:
> On 1/13/2014 4:32 PM, Guido van Rossum wrote:
>
>> I will doggedly keep posting to this thread rather than creating more
>> threads.
>
> Please permit to to doggedly keep pointing you toward the possible solution
> I posted on the tracker last October.

You're talking about http://bugs.python.org/issue3982 right?

>> But formatb() feels absurd to me. PEP 460 has neither a precise
>> specification or any actual examples, so I can't tell whether the
>
> Two days ago, I reposted byteformat() here on pydev with a precise text
> specification added to the code, and with an expanded test example. I have
> just added another example based on your question below.

That new example hasn't made it to my inbox yet, and I don't see
anything very recent in that issue either. But I don't think it
matters.

>> intention is that the format string can *only* contain {...} sequences
>> or whether it can also contain "regular" characters. Translating to
>> formatb(), my question comes down to the legality of the following
>> example:
>>
>>b'Hello, {}'.formatb(name)  # Where name is some bytes object
>>
>> If this is allowed, it reintroduces the ASCII bias (since the
>> substring 'Hello' is clearly ASCII).
>
> Since byteformat() uses re to find {} replacement fields, it
> only has such ascii bias as re has, which I believe is not much, if any. As
> far as re and byteformat are concerned, everything outside of the {...}
> fields is uninterpreted bytes. As far as bytes.join is concerned, both
> joiner and joined are uninterpreted bytes.
>
 byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
> b'\x00\x01\x02abcdef'
>
> re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias
> is the one already present is the representation of bytes, and the fact that
> Python code must have an ascii-compatible encoding.

I don't think it's that easy. Just searching for '{' is enough to
break in surprising ways unless the format string is encoded in an
ASCII superset. I can think of two easy examples to illustrate this
(they're similar to the example I posted here before about the
essential ASCII-ness of %c).

First, let's consider EBCDIC. The '{' character in ASCII is hex 7B
(decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC)
and that is the '#' character in EBCDIC. Surprised yet?

Next, let's consider UTF-16. This encoding uses two bytes per
character (except for surrogates), so any character whose top half or
bottom half happens to be 7B hex will cause an incorrect hit for your
regular expression. Ouch.

Of course, nobody in their right mind would use a format string
containing UTF-16 or EBCDIC. And that is precisely my point. When
you're using a format string, all of the format string (not just the
part between { and }) had better use ASCII or an ASCII superset. And
this (rightly) constrains the output to an ASCII superset as well.

> The advantage of
> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
> over directly writing
> b''.join([b'\x00', b'\x01', b'\x02', b'abc', b'def']
> is that one does not have to manually split the presumably constant template
> into chunks and interleave them with the presumable variable chunks.

Yes. And that's a great feature when the output is a known encoding
that's an ASCII superset. But a terrible idea when the encoding is
unconstrained.

> Here is the example that I used for testing, including non-blank format
> specs.
>
> bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float:
> {:7.2f}; end"
> objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3)
> result = byteformat(bformat, objects)

> b'bytes: abc; bytearray: def; unicode: ghi; int:   123; float:   12.30; end'

No surprises here. And in fact I think this is the desired outcome.

> The additional advantage here is the automatic encoding of formatted strings
> to bytes. As posted, byteformat() uses the str.encode defaults
> (encoding='utf-8', errors='strict'). But as I said in the post, these could
> become parameters to the function that are passed on to str.encode.

As long as that encoding is an ASCII superset this might be useful.

> The design reuses re.split, bytes.join, format, and the format
> specification. By re-using the format-spec as is, the only new thing to
> learn is that blank specs correspond to bytes instead of strings. This is
> easier to design, implement, and learn than if the format-spec is limited to
> disallow some things (after much bike-shedding over what to eliminate ;-).
>
> I would appreciate your comment on this proposal.

It seems to be a bit weak on the bytes encoding -- I would like to see
an explicit format code for those (your code looks a little clever in
this area). Others will probably object that it makes it too easy to
encode text by default, although I'm not sure it matters, given that
the behavior is quite different from Python 2's broken treatment of
interpolating Unicode in an 8-bit f

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread MRAB

On 2014-01-14 02:25, Terry Reedy wrote:

On 1/13/2014 4:32 PM, Guido van Rossum wrote:

  > I will doggedly keep posting to this thread rather than creating more
threads.

Please permit to to doggedly keep pointing you toward the possible
solution I posted on the tracker last October.

But formatb() feels absurd to me. PEP 460 has neither a precise
specification or any actual examples, so I can't tell whether the

Two days ago, I reposted byteformat() here on pydev with a precise text
specification added to the code, and with an expanded test example. I
have just added another example based on your question below.

intention is that the format string can *only* contain {...} sequences
or whether it can also contain "regular" characters. Translating to
formatb(), my question comes down to the legality of the following
example:

   b'Hello, {}'.formatb(name)  # Where name is some bytes object

If this is allowed, it reintroduces the ASCII bias (since the
substring 'Hello' is clearly ASCII).

Since byteformat() uses re to find {} replacement fields,
it only has such ascii bias as re has, which I believe is not much, if
any. As far as re and byteformat are concerned, everything outside of
the {...} fields is uninterpreted bytes. As far as bytes.join is
concerned, both joiner and joined are uninterpreted bytes.

  >>> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
b'\x00\x01\x02abcdef'

[snip]
Couldn't that suffer from false positives, i.e. binary data that
happens to match? (Rare, yes, but possible.)

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 5:14 PM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon  wrote:

I have been going on the assumption that bytes.format() would change what
'{}' meant for itself and would only interpolate bytes. That convenient
between Python 2 and 3 since it represents what we want it to (str and bytes
under the hood, respectively), so it just falls through. We could also add a
'b' conversion for bytes() explicitly so as to help people not accidentally
mix up things in bytes.format() and str.format(). But I was not suggesting
adding a specific format spec for bytes but instead making bytes.format()
just do the .encode('ascii') automatically to help with compatibility when a
format spec was present. If people want fancy formatting for bytes they can
always do it themselves before calling bytes.format().


This seems hastily written (e.g. verb missing :-), and I'm not clear
on what you are (or were) actually proposing. When exactly would
bytes.format() need .encode('ascii')?

I would be happy to wait a few hours or days for you to to write it up
clearly, rather than responding in a hurry.


I already posted my version of this proposal, with spec and example, in 
the thread "byteformat() proposal: please critique", and I added more in 
response to your earlier post.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On 1/13/2014 4:32 PM, Guido van Rossum wrote:

> I will doggedly keep posting to this thread rather than creating more 
threads.

Please permit to to doggedly keep pointing you toward the possible 
solution I posted on the tracker last October.

But formatb() feels absurd to me. PEP 460 has neither a precise
specification or any actual examples, so I can't tell whether the

Two days ago, I reposted byteformat() here on pydev with a precise text 
specification added to the code, and with an expanded test example. I 
have just added another example based on your question below.

intention is that the format string can *only* contain {...} sequences
or whether it can also contain "regular" characters. Translating to
formatb(), my question comes down to the legality of the following
example:

   b'Hello, {}'.formatb(name)  # Where name is some bytes object

If this is allowed, it reintroduces the ASCII bias (since the
substring 'Hello' is clearly ASCII).

Since byteformat() uses re to find {} replacement fields, 
it only has such ascii bias as re has, which I believe is not much, if 
any. As far as re and byteformat are concerned, everything outside of 
the {...} fields is uninterpreted bytes. As far as bytes.join is 
concerned, both joiner and joined are uninterpreted bytes.

>>> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
b'\x00\x01\x02abcdef'

re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii 
bias is the one already present is the representation of bytes, and the 
fact that Python code must have an ascii-compatible encoding.

The advantage of
byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
over directly writing
b''.join([b'\x00', b'\x01', b'\x02', b'abc', b'def']
is that one does not have to manually split the presumably constant 
template into chunks and interleave them with the presumable variable 
chunks.

Here is the example that I used for testing, including non-blank format 
specs.

bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: 
{:7.2f}; end"

objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3)
result = byteformat(bformat, objects)
>>>
b'bytes: abc; bytearray: def; unicode: ghi; int:   123; float:   12.30; end'

The additional advantage here is the automatic encoding of formatted 
strings to bytes. As posted, byteformat() uses the str.encode defaults 
(encoding='utf-8', errors='strict'). But as I said in the post, these 
could become parameters to the function that are passed on to str.encode.

The design reuses re.split, bytes.join, format, and the format 
specification. By re-using the format-spec as is, the only new thing to 
learn is that blank specs correspond to bytes instead of strings. This 
is easier to design, implement, and learn than if the format-spec is 
limited to disallow some things (after much bike-shedding over what to 
eliminate ;-).

I would appreciate your comment on this proposal.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread MRAB


On 2014-01-13 21:51, Guido van Rossum wrote:

Terminology. Let's use the official terminology rather than making stuff up.

The docs at http://docs.python.org/3/library/string.html#formatspec
use the following terminology:

Replacement field: {...}; contains field name, conversion, format spec
in that order, all optional.

Field name: either a decimal integer (referring to an argument by
position) or an identifier (by name), or omitted (uses the next
available position).

Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
value, and then the format spec applies to the resulting string.


If all you wanted to do was interpolate bytes then you could define a
new conversion !b. This would, however, mean that the format spec would
be applied to bytes.


Format spec: colon, bunch of stuff, type; the type is a letter such as
d (decimal) or s (string), and the stuff between the colon and the
type is used to specify field width, alignment, sign, padding and
such.


Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
this leaves for interpolating bytes if we don't want to use {:s}. The
docs at 
http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
don't show %b so it could still be used there, but it would be nicer
to be consistent.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 3:13 PM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon  wrote:

On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy  wrote:

I personally would not add 'bytes % whatever'.


Personally, neither would I; just focus on bytes.format() and let % operator
on strings slowly go away.


Well, % has some very strong arguments in its favor still -- for


If I shift from a 'personal' to a 'BDFL' viewpoint, I have to agree.


example, the sheer amount of code that currently uses it, the fact
that it's as close as we get to a cross-language standard, and the


This much I know.


fact that nobody wants to tackle its use in the logging module (since
logger objects are often shared between packages that don't know about
each other).


This I did not know.


Anyway, the % or .format() issue seems completely orthogonal to the
issues that get people riled up (which are mostly about whether using
either implies some kind of ASCII compatibility).


A possibly important difference between '%s' and '{:s}' is that the 's' 
is required in the former and optional in the latter. So in 
byteformat(), b'{:s}' continues to format a string (as encoded bytes) 
while '{:}' 'formats' a byte without having to invent a new code that 
does not exist in 2.7. That particular solution to "does 's' mean bytes 
or string" does not work for % formatting. (And that lack, in turn, is 
part of what lay behind the inclination expressed above.)


For % formatting, I would be inclined to start with 'what does mecurial 
need?' or even 'does anything even really work for hg?'. Hg is part of 
our development ecosystem, and we have an hg rep who expressed a desire 
to experiment.


Terry

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Nick Coghlan wrote:
Arbitrary binary data and ASCII  compatible binary data are *different 
things* and the only argument in favour of modelling them with a single 
type is because Python 2 did it that way.


I would say that ASCII compatible binary data is a
*subset* of arbitrary binary data. As such, a type
designed for arbitrary binary data is a perfectly good
way of representing ASCII compatible binary data.

What are you saying -- that there should be one type
for ASCII compatible binary data, and another type
for all binary data *except* when it's ASCII compatible?

That makes no sense to me.

The Python 3 text model was built on the notion of "no implicit encoding 
and decoding"


This is nonsense. There are plenty of implicit
encoding and decoding operations in Python 3.

When you open a text file, it gets an encoding. After
that, anything you write to it is implicitly encoded
using that encoding. There's even a default encoding
when you open the file, so you don't even have to be
explicit about that.

It's more correct to say that it was built on the
notion of using separate types for encoded and
decoded data, so that it's *possible* to keep track
of the difference. It doesn't mean that there can't
be conversions between the two types that are
implicit to one degree or another.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Nick Coghlan wrote:


so the latter would be less of 
an attractive nuisance when writing code that needs to handle arbitrary 
binary formats and can't assume ASCII compatibility.


Hang on a moment. What do you mean by code that
"handles arbitrary binary formats"?

As far as I can see, the proposed features are for
code that handles *particular* binary formats. Ones
with well-defined fields that are specified to contain
ASCII-encoded text. It's the programmer's responsibility
to make sure that the fields he's treating as ASCII
really do contain ASCII, just as it's his responsibility
to make sure he reads and writes a text file using
the correct encoding.

Now, it's possible that if you were working from an
incomplete spec and some examples, you might be
led to believe that a particular field was ASCII
when in fact it was some ASCII superset such as
latin1 or utf8. In that case, if you parsed it
assuming ASCII, you would get into trouble of
some sort with bytes greater than 127.

However, the proposed formatting operations are
concerned only with *generating* binary data, not
parsing it. Under Guido's proposed semantics, all
of the ASCII formatting operations are guaranteed
to produce valid ASCII, regardless of what types
or values are thrown at them. So as long as the
field's true encoding is something ASCII-compatible,
you will always generate valid data.

Because I *want to use* the PEP 460 binary interpolation API, but 
wouldn't be able to use Guido's more lenient proposal, as it is a bug 
magnet in the presence of arbitrary binary data.


Where exactly is this "arbitrary binary data" that you
keep talking about? The only place that arbitrary
bytes comes into the picture is through b"%s" % b"...",
and that's defined to just pass the bytes straight
through. I don't see how that could attract any
bugs that weren't already present in the data being
interpolated.

The LHS may or may not be tainted with assumptions about ASCII 
compatibility, which means it effectively *is* tainted with such 
assumptions, which means code that needs to handle arbitrary binary data 
can't use it and is left without a binary interpolation feature.


If I understand correctly, what concerns you here
is that you can't tell by looking at b"%s" % x
whether it encodes anything as ASCII without knowing
the type of x.

I'm not sure how serious a problem that would be.
Most of the time I think it will be fairly obvious
from the purpose of the code what the type of x
is *intended* to be. If it's not actually that type,
then clearly there's a bug somewhere.

Of all such possible bugs, the one most likely to
arise due to a confusion in the programmer's mind
between text and bytes would be for x to be a string
when it was meant to be bytes or vice versa.

Due to the still-very-strong separation between text
and bytes in Py3, this is unlikely to happen without
something else blowing up first.

Even if it does happen, it won't result in a data-
dependent failure. If b"%s" % 'hello' were defined to
interpolate 'hello'.encode('ascii'), then there *would*
be cause for concern. But this is not what Guido
proposes -- instead he proposes interpolating
ascii('hello') == "'hello'". This is almost certainly
*never* what the file spec calls for, so you'll find
out about it very soon one way or another.

Effectively this means that b"%s" % x where x is a
string is useless, so I'd much prefer it to just
raise an exception in that case to make the failure
immediately obvious. But either way, you're not
going to end up with a latent failure waiting for
some non-ASCII data to come along before you notice
it.

To summarise, I think the idea of binary format strings
being too "tainted" for a program that does not want
to use ASCII formatting to rely on is mostly FUD.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 1:59 PM, Guido van Rossum wrote:

On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman  wrote:

On 1/13/2014 12:09 PM, Guido van Rossum wrote:

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.


%s for strictly interpolating bytes eases porting. Sad name, but good for
compatibility. When the blowup happens, due to having a str type passed, the
porter adds the appropriate .encode(...) to the parameter, so it doesn't
blow up on Py 3, and it'll be OK for Py 2 as well, will it not?

Lots of code uses %s with numbers too, and probably the occasional
None or list (relying on the Python 2 near-guarantee that most
objects' str() is their repr() and that repr() nearly guarantees to
return only ASCII).

E.g. I'm sure you can find live code doing something like

headers.append('Content-Length: %s\r\n' % len(body))


That's portably fixable by switching to %d... or by adding .encode('ascii')
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Nick Coghlan

On 14 Jan 2014 04:58, "Guido van Rossum"  wrote:
>
> Let me try rebooting the reboot.
>
> My interpretation of Nick's argument is that he are asking for a bytes
> formatting language that doesn't have an implicit ASCII assumption.
>
> To me this feels absurd. The formatting codes (%s, %c) themselves are
> expressed as ASCII characters. If you include anything else in the
> format string besides formatting codes (e.g. b'<%s>'), you are giving
> it as ASCII characters. I don't know what characters the EBCDIC codes
> 37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
> it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

Except we allow string escapes and programmatic creation of format strings,
so while ASCII snippets in formatting code are certainly easier to type,
they are by no means a mandatory feature of using interpolation operations.
I agree

Can you roll your own binary interpolation support with join() and simple
concatenation? Yes, but Antoine's proposal provides a clean and reliable
approach to flexible binary templating that isn't offered by the more
lenient version.

My problem is with telling Python users that if they're working with ASCII
compatible data, they get access to a clean interpolation mini-language for
templating purposes, but if they aren't, they don't.

That's the part I see as potentially breaking the text model: now you have
a convenient API on a core type encouraging you to treat your data as ASCII
compatible with implicit serialisation of semantic data as ASCII text, even
if that may not be appropriate.

If pure binary interpolation is added at the same time (regardless of the
exact spelling, so long as it's as easy to access as the ASCII templating),
that objection goes away.

That said, the fact that the interpolation mini-languages themselves assume
ASCII is the most compelling rationale I have heard so far for treating
interpolation as an operation that inherently assumes ASCII compatibility -
you can't use arbitrary bytes in your formatting strings without escaping
the formatting characters appropriately. While I don't see that as
substantially different to needing to escape them in order to retain them
in the output of text or ASCII formatting, it's at least a teachable
rationale for the absence of a pure binary equivalent.

> If I had some byte strings in an unknown encoding (but the same
> encoding for all) that I needed to concatenate I would never think of
> '%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)
>
> If I see some code using *any* formatting operation (regardless of
> whether it's %d, %r, %s or %c) I am going to assume that there is some
> ASCII-ness, and if there isn't, the code's author has obscured their
> goal to me.

Right, that's a rationale I can explain to people. It also occurred to me
that it's easier to build pure binary interpolation on top of ASCII
interpolation than I previously thought: I can just check all the input
values are compatible with memoryview. At that point, attempting to pass in
anything that would trigger implicit encoding at the formatting stage will
fail.

(Aside: bytes(memoryview(obj)) is also a potentially handy way to avoid the
bytes(int)) trap)

> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
> clear, and if the noise about that sub-issue is preventing folks from
> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
> use %b which would require its argument to be bytes. Those bytes
> should still probably be ASCII-ish, but there's no way to test that.
> That's fine with me and should be fine to Nick as well -- PEP 460
> doesn't check that your encodings match (how could it? :-), nor does
> plain string concatenation using +.

Plus there genuinely are formats where different parts have different
encodings and you rely on metadata or format definitions to know what they
are.

I would actually suggest something like Brett's approach for %s , but with
memoryview in the mix: if the object exports a PEP 3118 buffer, interpolate
it directly, otherwise invoke normal string formatting and then do strict
ASCII encoding at the end.

That way people don't have to learn new formatting mini-languages and only
have two new behaviours to learn: buffer exporters are interpolated
directly, anything else is formatted normally and then implicitly encoding
as strict ASCII.

>
> In my head I make the following classification of situations where you
> work with bytes and/or text.
>
> (A) Pure binary formats (e.g. most IP-level packet formats, media
> files, .pyc files, tar/zip files, compressed data, etc.). These are
> handled using the struct module (e.g. tar/zip) and/or custom C
> extensions (e.g. gzip).
>
> (B) Encoded text. Here you should just decode everything into str
> objects and parse your text at that level. If you really want to
> manipulate the data as bytes (e.g. because you have a lot of data to
> process and very light processing) you may be a

Re: [Python-Dev] PEP 460 reboot


On Jan 13, 2014, at 5:31 PM, Donald Stufft  wrote:

> %s not accepting str is the major thing I’d personally be against.

To be more clear

b”%s” % “abc” == No
b”%s” % 123 == Fine

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On Jan 13, 2014, at 5:25 PM, Eric V. Smith  wrote:

> On 1/13/2014 4:59 PM, Guido van Rossum wrote:
>> On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman  
>> wrote:
>>> If somehow (unlikely though it seems) we end up keeping %s (e.g.
>>> strictly to ease porting), we could also keep %r as an alias for %a.
>>> 
>>> 
>>> %s for strictly interpolating bytes eases porting. Sad name, but good for
>>> compatibility. When the blowup happens, due to having a str type passed, the
>>> porter adds the appropriate .encode(...) to the parameter, so it doesn't
>>> blow up on Py 3, and it'll be OK for Py 2 as well, will it not?
>> 
>> Lots of code uses %s with numbers too, and probably the occasional
>> None or list (relying on the Python 2 near-guarantee that most
>> objects' str() is their repr() and that repr() nearly guarantees to
>> return only ASCII).
>> 
>> E.g. I'm sure you can find live code doing something like
>> 
>> headers.append('Content-Length: %s\r\n' % len(body))
>> 
> 
> That's why I think we should support %s taking bytes, int, float. And
> make %b mean the same thing, if you want. But I think we need to keep %s
> (however limited) for compatibility with Python 2.
> 
> Personally, I'd be okay with %s not accepting str (by raising an exception).
> 
> I think that would give us a large "compatibility surface" in common
> with Python 2.

%s not accepting str is the major thing I’d personally be against. %s taking 
numeric
types and bytes would be fine. The main thing i’d be worried about is where the 
RHS
may possibly contain something non ASCII that needs encoding (such as the str 
case).


-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Eric V. Smith

On 1/13/2014 4:59 PM, Guido van Rossum wrote:
> On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman  
> wrote:
>> If somehow (unlikely though it seems) we end up keeping %s (e.g.
>> strictly to ease porting), we could also keep %r as an alias for %a.
>>
>>
>> %s for strictly interpolating bytes eases porting. Sad name, but good for
>> compatibility. When the blowup happens, due to having a str type passed, the
>> porter adds the appropriate .encode(...) to the parameter, so it doesn't
>> blow up on Py 3, and it'll be OK for Py 2 as well, will it not?
> 
> Lots of code uses %s with numbers too, and probably the occasional
> None or list (relying on the Python 2 near-guarantee that most
> objects' str() is their repr() and that repr() nearly guarantees to
> return only ASCII).
> 
> E.g. I'm sure you can find live code doing something like
> 
> headers.append('Content-Length: %s\r\n' % len(body))
> 

That's why I think we should support %s taking bytes, int, float. And
make %b mean the same thing, if you want. But I think we need to keep %s
(however limited) for compatibility with Python 2.

Personally, I'd be okay with %s not accepting str (by raising an exception).

I think that would give us a large "compatibility surface" in common
with Python 2.

Eric.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 2:05 PM, Brett Cannon  wrote:
> I have been going on the assumption that bytes.format() would change what
> '{}' meant for itself and would only interpolate bytes. That convenient
> between Python 2 and 3 since it represents what we want it to (str and bytes
> under the hood, respectively), so it just falls through. We could also add a
> 'b' conversion for bytes() explicitly so as to help people not accidentally
> mix up things in bytes.format() and str.format(). But I was not suggesting
> adding a specific format spec for bytes but instead making bytes.format()
> just do the .encode('ascii') automatically to help with compatibility when a
> format spec was present. If people want fancy formatting for bytes they can
> always do it themselves before calling bytes.format().

This seems hastily written (e.g. verb missing :-), and I'm not clear
on what you are (or were) actually proposing. When exactly would
bytes.format() need .encode('ascii')?

I would be happy to wait a few hours or days for you to to write it up
clearly, rather than responding in a hurry.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou

On Mon, 13 Jan 2014 13:56:44 -0800
Guido van Rossum  wrote:
> On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou  wrote:
> > On Mon, 13 Jan 2014 13:32:28 -0800
> > Guido van Rossum  wrote:
> >>
> >> But formatb() feels absurd to me. PEP 460 has neither a precise
> >> specification or any actual examples, so I can't tell whether the
> >> intention is that the format string can *only* contain {...} sequences
> >> or whether it can also contain "regular" characters. Translating to
> >> formatb(), my question comes down to the legality of the following
> >> example:
> >>
> >>   b'Hello, {}'.formatb(name)  # Where name is some bytes object
> >
> > Yes, it's allowed. But so is:
> >
> >   b'\xff\x00{}\x85{}'.formatb(payload, trailer)
> >
> > The ASCII bias is because of the bytes literal notation.
> 
> But it is nevertheless there. Including arbitrary hex bytes in the
> ASCII range should be a liability, unless you have memorized the hex
> codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'.

That's a good point. I hadn't really thought about that.

> The above example (is it from a real protocol?)

(no, it's cooked up)

> would be just as clear
> or clearer written as
> 
> b'\xff\x00' + payload + b'\x85' + trailer
> 
> or
> 
> b''.join([b'\xff\x00', payload, b'\x85', trailer])
> 
> and reasoning about those versions requires no understanding of ASCII.

Fair enough.

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 4:36 PM, Ethan Furman  wrote:

> On 01/13/2014 01:20 PM, Mark Lawrence wrote:
>
>> On 13/01/2014 21:01, Paul Moore wrote:
>>
>>>
>>> I think this should be for 3.5, and should not involve an accelerated
>>> release of 3.5 - we should get it into the 3.5 code early and let
>>> people thrash out the details during the 3.5 release cycle.
>>>
>>
>> I disagree, it should be on pypi now so people can start trying it out,
>> or as others have suggested incorporate it into
>> the six module.  Surely that'd make the job of getting it into 3.5 far
>> easier?
>>
>
> It's a bit harder to put a core feature on PyPI.  I'm not even sure how it
> would be done.  Fortunately, once it is in 3.5 trunk the adventurous can
> build their own and try it out that way.
>

You make it a function that under Python 2 and < 3.5 does what needs to be
done and on 3.5 just directly calls the underlying method. People will
still have to change their code, but the idea is it becomes a refactoring
instead of a change in how the code is structured.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 4:59 PM, Guido van Rossum  wrote:
> On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman  
> wrote:
>> On 1/13/2014 12:09 PM, Guido van Rossum wrote:
>>
>> Yeah, the %s behavior with a string argument was a messy attempt at
>> compromise. I was hoping to mimick a common use of %s in Python 2,
>> where it can be used with either an 8-bit string or a number as
>> argument, acting like %b in the former case and like %d in the latter
>> case. Not having %s at all in Python 3 means that porting requires
>> more thinking (== more opportunity for mistakes when you're converting
>> in bulk) and there's no easy way to write code that works in Python 2
>> and 3.
>>
>> If we have %b for strictly interpolating bytes, I'm fine with adding
>> %a for calling ascii() on the argument and then interpolating the
>> result after ASCII-encoding it.
>>
>> If somehow (unlikely though it seems) we end up keeping %s (e.g.
>> strictly to ease porting), we could also keep %r as an alias for %a.
>>
>>
>> %s for strictly interpolating bytes eases porting. Sad name, but good for
>> compatibility. When the blowup happens, due to having a str type passed, the
>> porter adds the appropriate .encode(...) to the parameter, so it doesn't
>> blow up on Py 3, and it'll be OK for Py 2 as well, will it not?
>
> Lots of code uses %s with numbers too, and probably the occasional
> None or list (relying on the Python 2 near-guarantee that most
> objects' str() is their repr() and that repr() nearly guarantees to
> return only ASCII).
>
> E.g. I'm sure you can find live code doing something like
>
> headers.append('Content-Length: %s\r\n' % len(body))

But if the alternative is spurious quotes then the choice is clear...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 4:51 PM, Guido van Rossum  wrote:

> Terminology. Let's use the official terminology rather than making stuff
> up.
>
> The docs at http://docs.python.org/3/library/string.html#formatspec
> use the following terminology:
>
> Replacement field: {...}; contains field name, conversion, format spec
> in that order, all optional.
>
> Field name: either a decimal integer (referring to an argument by
> position) or an identifier (by name), or omitted (uses the next
> available position).
>
> Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
> value, and then the format spec applies to the resulting string.
>
> Format spec: colon, bunch of stuff, type; the type is a letter such as
> d (decimal) or s (string), and the stuff between the colon and the
> type is used to specify field width, alignment, sign, padding and
> such.
>
>
> Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
> this leaves for interpolating bytes if we don't want to use {:s}. The
> docs at
> http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
> don't show %b so it could still be used there, but it would be nicer
> to be consistent.

I have been going on the assumption that bytes.format() would change what
'{}' meant for itself and would only interpolate bytes. That convenient
between Python 2 and 3 since it represents what we want it to (str and
bytes under the hood, respectively), so it just falls through. We could
also add a 'b' conversion for bytes() explicitly so as to help people not
accidentally mix up things in bytes.format() and str.format(). But I was
not suggesting adding a specific format spec for bytes but instead making
bytes.format() just do the .encode('ascii') automatically to help with
compatibility when a format spec was present. If people want fancy
formatting for bytes they can always do it themselves before calling
bytes.format().
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 1:29 PM, Glenn Linderman  wrote:
> On 1/13/2014 12:09 PM, Guido van Rossum wrote:
>
> Yeah, the %s behavior with a string argument was a messy attempt at
> compromise. I was hoping to mimick a common use of %s in Python 2,
> where it can be used with either an 8-bit string or a number as
> argument, acting like %b in the former case and like %d in the latter
> case. Not having %s at all in Python 3 means that porting requires
> more thinking (== more opportunity for mistakes when you're converting
> in bulk) and there's no easy way to write code that works in Python 2
> and 3.
>
> If we have %b for strictly interpolating bytes, I'm fine with adding
> %a for calling ascii() on the argument and then interpolating the
> result after ASCII-encoding it.
>
> If somehow (unlikely though it seems) we end up keeping %s (e.g.
> strictly to ease porting), we could also keep %r as an alias for %a.
>
>
> %s for strictly interpolating bytes eases porting. Sad name, but good for
> compatibility. When the blowup happens, due to having a str type passed, the
> porter adds the appropriate .encode(...) to the parameter, so it doesn't
> blow up on Py 3, and it'll be OK for Py 2 as well, will it not?

Lots of code uses %s with numbers too, and probably the occasional
None or list (relying on the Python 2 near-guarantee that most
objects' str() is their repr() and that repr() nearly guarantees to
return only ASCII).

E.g. I'm sure you can find live code doing something like

headers.append('Content-Length: %s\r\n' % len(body))

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 1:40 PM, Antoine Pitrou  wrote:
> On Mon, 13 Jan 2014 13:32:28 -0800
> Guido van Rossum  wrote:
>>
>> But formatb() feels absurd to me. PEP 460 has neither a precise
>> specification or any actual examples, so I can't tell whether the
>> intention is that the format string can *only* contain {...} sequences
>> or whether it can also contain "regular" characters. Translating to
>> formatb(), my question comes down to the legality of the following
>> example:
>>
>>   b'Hello, {}'.formatb(name)  # Where name is some bytes object
>
> Yes, it's allowed. But so is:
>
>   b'\xff\x00{}\x85{}'.formatb(payload, trailer)
>
> The ASCII bias is because of the bytes literal notation.

But it is nevertheless there. Including arbitrary hex bytes in the
ASCII range should be a liability, unless you have memorized the hex
codes for ASCII and know that e.g. '\x25' is '%' and '\x7b' is '{'.

The above example (is it from a real protocol?) would be just as clear
or clearer written as

b'\xff\x00' + payload + b'\x85' + trailer

or

b''.join([b'\xff\x00', payload, b'\x85', trailer])

and reasoning about those versions requires no understanding of ASCII.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/13/2014 01:20 PM, Mark Lawrence wrote:

On 13/01/2014 21:01, Paul Moore wrote:


I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.


I disagree, it should be on pypi now so people can start trying it out, or as 
others have suggested incorporate it into
the six module.  Surely that'd make the job of getting it into 3.5 far easier?


It's a bit harder to put a core feature on PyPI.  I'm not even sure how it would be done.  Fortunately, once it is in 
3.5 trunk the adventurous can build their own and try it out that way.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Nick Coghlan wrote:

By allowing format characters that *do* assume ASCII, the entire
construct is rendered unsafe - you have to look inside the format
string to determine if it is assuming ASCII compatibility or not, thus
the entire construct must be deemed as assuming ASCII compatibility at
the level of static semantic analysis.


I don't see how any of the currently proposed formatting
operations make a data-dependent ASCII assumption.

When you write b"%d" % x, you're
not assuming that x is ASCII, you're assuming that it's
an *integer*. The %d conversion of an integer is defined
to produce only ASCII characters, and it works on any
integer, so there's no data-dependent assumption there.

Something that *would* involve such an assumption would
be if b"%s" % 'hello' were defined to encode 'hello' as
ASCII. But Guido has proposed not doing that, and instead
interpolating ascii('hello'). Since ascii() is defined to
return only ASCII characters, and works on any string,
there is again no data-dependent assumption.

My preference would be for b"%s" % 'hello' to raise an
exception, but that would still be data-independent.

As for having to look inside the format string to know
what types are expected, that's no different from any
other formatting operation. All it means is that static
type analysis in Python is hard, but we already knew
that.


Allowing these ASCII assuming format codes in the core bytes
interpolation introduces *exactly* the same problem as is present in
the Python 2 text model: code that *appears* to support arbitrary
binary data, but is in fact assuming ASCII compatibility.


Can you provide an example of code using Guido's
currently approved formatting semantics that would
fail when given arbitrary binary data? I don't see
how it can happen.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

Terminology. Let's use the official terminology rather than making stuff up.

The docs at http://docs.python.org/3/library/string.html#formatspec
use the following terminology:

Replacement field: {...}; contains field name, conversion, format spec
in that order, all optional.

Field name: either a decimal integer (referring to an argument by
position) or an identifier (by name), or omitted (uses the next
available position).

Conversion: !r, !s, !a; these refer to repr(), str(), ascii() to the
value, and then the format spec applies to the resulting string.

Format spec: colon, bunch of stuff, type; the type is a letter such as
d (decimal) or s (string), and the stuff between the colon and the
type is used to specify field width, alignment, sign, padding and
such.


Also. {:b} means binary (i.e. numbers in base 2). I'm not sure what
this leaves for interpolating bytes if we don't want to use {:s}. The
docs at 
http://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
don't show %b so it could still be used there, but it would be nicer
to be consistent.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/13/2014 01:08 PM, Glenn Linderman wrote:


+1 - what Ethan said. A real death, instead death by inappropriately transformed data, is 
fine by me, if b"%s" %
str(...) doesn't have the appropriate .encode(...) call. But I could live with 
either.


You mean instead of death by a thousand quotes?  *ducks and runs*

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Antoine Pitrou

On Mon, 13 Jan 2014 13:32:28 -0800
Guido van Rossum  wrote:
> 
> But formatb() feels absurd to me. PEP 460 has neither a precise
> specification or any actual examples, so I can't tell whether the
> intention is that the format string can *only* contain {...} sequences
> or whether it can also contain "regular" characters. Translating to
> formatb(), my question comes down to the legality of the following
> example:
> 
>   b'Hello, {}'.formatb(name)  # Where name is some bytes object

Yes, it's allowed. But so is:

  b'\xff\x00{}\x85{}'.formatb(payload, trailer)

The ASCII bias is because of the bytes literal notation.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 12:09 PM, Guido van Rossum wrote:

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.


%s for strictly interpolating bytes eases porting. Sad name, but good 
for compatibility. When the blowup happens, due to having a str type 
passed, the porter adds the appropriate .encode(...) to the parameter, 
so it doesn't blow up on Py 3, and it'll be OK for Py 2 as well, will it 
not?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

I will doggedly keep posting to this thread rather than creating more threads.

In another thread, Nick has said he's okay with my proposal (not sure
if that includes %s or not, but it now seems of lesser importance) as
long as we simultaneously introduce formatb() and formatb_map() (the
latter is just a minor variation of the former, so I won't mention it
further).

But formatb() feels absurd to me. PEP 460 has neither a precise
specification or any actual examples, so I can't tell whether the
intention is that the format string can *only* contain {...} sequences
or whether it can also contain "regular" characters. Translating to
formatb(), my question comes down to the legality of the following
example:

  b'Hello, {}'.formatb(name)  # Where name is some bytes object

If this is allowed, it reintroduces the ASCII bias (since the
substring 'Hello' is clearly ASCII).

If this isn't allowed, it feels like a perversion of the notion of a
"formatting language", and I really don't see the attraction over
using a combination of concatenation and the struct module, perhaps
augmented with some use of bytes([i]) as an alternative to %c or {!c}
(if that is what is meant by PEP 460 with 'c modifier' -- I can't find
the word 'modifier' in the docs for format().

Note that I honestly don't understand which of these PEP 460 means.

Either way, PEP 460's motivation seems kind of subjective and esthetic:

"""
While there are reasonably efficient ways to accumulate binary data
(such as using a bytearray object, the bytes.join method or even
io.BytesIO), none of them leads to the kind of readable and intuitive
code that is produced by a %-formatted or {}-formatted template and a
formatting operation.
"""

I would buy this if a binary format string could contain embedded text
(like 'Hello' in my example above), but then the argument about
avoiding ASCII bias seems to fall apart so I am at a loss about what
Nick actually wants, and even about what PEP 460 actually specifies.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Glenn Linderman wrote:

Quotes in the stream are a great debug hint, without blowing up.


But do you really want those quotes turning up in
a *binary* stream, where they're somewhere between
awkward and near-impossible to spot by eyeballing,
and may only be discovered when something else --
likely a different program, possibly being run
by a different person -- tries to read the data
back, and blows up because the binary format is
corrupted?

I'd much rather it blew up at the writing stage,
myself. Corrupted binary data is *much* harder to
debug than corrupted text, because binary formats
typically have little to no margin for error
before they become complete garbage.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Mark Lawrence


On 13/01/2014 21:01, Paul Moore wrote:


I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.


I disagree, it should be on pypi now so people can start trying it out, 
or as others have suggested incorporate it into the six module.  Surely 
that'd make the job of getting it into 3.5 far easier?




Paul.

PS For all the heated arguments and occasional frayed tempers, this
has been an impressively civil debate. I think that's one of the best
things about python-dev, that discussions like these never degenerate
into flamewars. Kudos to all concerned!



+1

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 9:38 AM, Ethan Furman wrote:

On 01/13/2014 09:31 AM, Antoine Pitrou wrote:

On Mon, 13 Jan 2014 08:36:05 -0800
Ethan Furman wrote:


You mean crash all the time?  I'd be fine with that for both the str 
case

and the bytes case.  But's probably too late
to change the str case, and the bytes case should mirror what str does.


Let me add something else: str and bytes don't have to be symmetrical.
In Python 2, str and unicode were symmetrical, they allowed exactly the
same operations and were composable.
In Python 3, str and bytes are different beasts; they have different
operations *and* different semantics (for example, bytes interoperates
with bytearray and memoryview, while str doesn't).


This makes sense to me.

So I'm guess I'm fine with either the quoted ascii repr or the always 
blowing up method, with leaning towards the blowing up method. 


+1 - what Ethan said. A real death, instead death by inappropriately 
transformed data, is fine by me, if b"%s" % str(...) doesn't have the 
appropriate .encode(...) call. But I could live with either.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 10:40 AM, Brett Cannon wrote:
This even gives people in-place ASCII encoding for strings by always 
using '{:s}' with text which they can do when they port their code to 
run under both Python 2 and 3. So you should be able to do 
``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII. 
If you want more explicit encoding to latin-1 then you need to do it 
explicitly and not rely on the mini-language to do tricks for you.


My preference is not have any, but if Guido is going say PBP here then 
I want absolute consistency across the board in how bytes.format() 
tweaks things.


As for %s for the % operator calling ascii(), I think that will be a 
porting nightmare of finding out why your bytes suddenly stopped being 
formatted properly and then having to crawl through all of your code 
for that one use of %s which is getting bytes in. By raising a 
TypeError you will very easily detect where your screw-up occurred 
thanks to the traceback; do so otherwise feels too much like implicit 
type conversion and ask any JavaScript developer how that can be a bad 
thing.




So quote 3 is necessarily a violation of quote 1.  But if quote 2 can 
allow for one exception to its absolute consistency... that is probably 
the best solution overall...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot and a bitter fight


On 1/13/2014 5:06 AM, Nick Coghlan wrote:

I figured out tonight that it's only positioning ASCII interpolation
as an*alternative*  to adding binary interpolation that I have a
problem with. It isn't, because you lose the structural assurance that
you haven't inadvertently introduced an assumption of ASCII
compatibility when you didn't need to. However, interpolation support
is a convenient enough interface that I can see a version that*only*
supports ASCII compatible interpolation being an attractive nuisance
that becomes a source of hard to detect and fix data corruption bugs
(just like the str type in Python 2).

If we add both, my objections go away: people like me can use the
Python 3 only formatb and formatb_map methods and be confident we
haven't inadvertently introduced any assumptions regarding ASCII
compatibility, while folks that know they're dealing with an ASCII
compatible format can use the ASCII assuming versions that are
designed to be source compatible with Python 2.

If someone incorrectly uses format() or format_map() when they should
be using the pure binary versions, that's a trivial bug fix (adding
the necessary "b", and perhaps some explicit encoding calls) rather
than a major restructuring of the code.

If they use mod-formatting, that's a slightly bigger fix, but still
just switching to a different spelling of the formatting operation.

Both use cases (binary only and ASCII compatible) get covered cleanly,
and nobody has to lose out.

Cheers,
Nick.


As part of that, what about an alternate spelling of  %  to allow 
binary-only interpolation operations using the handy syntax of % ? 
Doesn't seem like  /  is defined for bytes or str on the LHS.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Paul Moore

On 13 January 2014 18:58, Guido van Rossum  wrote:
> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
> clear, and if the noise about that sub-issue is preventing folks from
> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
> use %b which would require its argument to be bytes. Those bytes
> should still probably be ASCII-ish, but there's no way to test that.
> That's fine with me and should be fine to Nick as well -- PEP 460
> doesn't check that your encodings match (how could it? :-), nor does
> plain string concatenation using +.

For the record, Guido's reboot posting and rationale has convinced me,
and I am essentially in favour of his proposal.

Nick's remaining objection seems to me to have some validity if the
format string is a user-supplied variable, but this type of usage is
vanishingly small in my experience, and shouldn't dictate the whole
design.

I don't like b'%s' % 'x' behaviour, and would prefer one of the
alternatives. I'm not entirely clear about the details of the
alternative proposals, so I won't try to pick one.

I think this should be for 3.5, and should not involve an accelerated
release of 3.5 - we should get it into the 3.5 code early and let
people thrash out the details during the 3.5 release cycle.

Paul.

PS For all the heated arguments and occasional frayed tempers, this
has been an impressively civil debate. I think that's one of the best
things about python-dev, that discussions like these never degenerate
into flamewars. Kudos to all concerned!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


Guido van Rossum wrote:

On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman  wrote:


On 01/12/2014 04:47 PM, Guido van Rossum wrote:



b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
enclosed in single quotes)


I'm not sure about the quotes.  Would anyone ever actually want those in the
byte stream?


Perhaps not, but it's a hint that you should probably think about an
encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of
it as payback time. :-)


If it's never useful, wouldn't it be better to raise an
exception in this case?

That way, someone porting code from py2 that does this
without appropriate modification will find out about
the problem immediately, rather than have spurious
quotes inserted into their binary data, which -- being
binary data -- will likely go unnoticed until something
else tries to read the data.

I don't think the rule against operations that work on
all-but-one-type really applies here, because the mistake
it's intended to catch is not an obscure corner case.
If your program's logic includes interpolating strings
into bytes objects, then you're going to be testing
that.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 1/13/2014 1:49 AM, Mark Shannon wrote:

So why not replace '%s' with '%a' for the ascii case and
with '%b' for directly inserting bytes.


Because %a and %b don't exist in Python 2.7?


I thought this was about 3.5, not 2.7 ;)
'%s' can't work in 3.5, as we must differentiate between
strings which meed to be encoded and bytes which don't. 


It's about migrating code to reach a point where it can work on both 2.7 
and 3.5.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/13/2014 12:02 PM, Brett Cannon wrote:


Personally, neither would I; just focus on bytes.format() and let % operator on 
strings slowly go away.


Hey, now, some of us like %!  ;)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Eric V. Smith

On 01/13/2014 03:09 PM, Guido van Rossum wrote:
> If we have %b for strictly interpolating bytes, I'm fine with adding
> %a for calling ascii() on the argument and then interpolating the
> result after ASCII-encoding it.
> 
> If somehow (unlikely though it seems) we end up keeping %s (e.g.
> strictly to ease porting), we could also keep %r as an alias for %a.

Wouldn't %s as an alias for %b simplify porting from Python 2?


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 3:11 PM, Yury Selivanov  wrote:
> On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote:
>>
>> I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar"
>>
>> Instead of "%b" could "%j" mean "I should have used + or join()
>> here
>> but was too lazy" and work on str too?
>
> Isn’t this just error prone? Since it’s a new format character, many,
> probably, would write %s by mistake. And, besides, there was no %j
> in python2.

Merely a flesh wound.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 12:02 PM, Brett Cannon  wrote:
> On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy  wrote:
>> I personally would not add 'bytes % whatever'.
>
> Personally, neither would I; just focus on bytes.format() and let % operator
> on strings slowly go away.

Well, % has some very strong arguments in its favor still -- for
example, the sheer amount of code that currently uses it, the fact
that it's as close as we get to a cross-language standard, and the
fact that nobody wants to tackle its use in the logging module (since
logger objects are often shared between packages that don't know about
each other).

Anyway, the % or .format() issue seems completely orthogonal to the
issues that get people riled up (which are mostly about whether using
either implies some kind of ASCII compatibility).

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Yury Selivanov

On January 13, 2014 at 3:08:43 PM, Daniel Holth (dho...@gmail.com) wrote:
>  
> I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar"  
>  
> Instead of "%b" could "%j" mean "I should have used + or join()  
> here
> but was too lazy" and work on str too?

Isn’t this just error prone? Since it’s a new format character, many,
probably, would write %s by mistake. And, besides, there was no %j
in python2.

-
Yury
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 11:57 AM, Barry Warsaw  wrote:

> On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote:

>>On Jan 13, 2014, at 1:58 PM, Guido van Rossum  wrote:

>>> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
>>> clear, and if the noise about that sub-issue is preventing folks from
>>> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
>>> use %b which would require its argument to be bytes. Those bytes
>>> should still probably be ASCII-ish, but there's no way to test that.
>>> That's fine with me and should be fine to Nick as well -- PEP 460
>>> doesn't check that your encodings match (how could it? :-), nor does
>>> plain string concatenation using +.

>>I think disallowing %s is the right thing to do, but I definitely think
>>numbers and %b should be allowed.

> I guess I agree.  The behavior of b'%s' % 'x' returning b"'x'" is almost
> always useless at best.  (I would have thought maybe %a for ascii() but don't
> care that strongly.)

Yeah, the %s behavior with a string argument was a messy attempt at
compromise. I was hoping to mimick a common use of %s in Python 2,
where it can be used with either an 8-bit string or a number as
argument, acting like %b in the former case and like %d in the latter
case. Not having %s at all in Python 3 means that porting requires
more thinking (== more opportunity for mistakes when you're converting
in bulk) and there's no easy way to write code that works in Python 2
and 3.

If we have %b for strictly interpolating bytes, I'm fine with adding
%a for calling ascii() on the argument and then interpolating the
result after ASCII-encoding it.

If somehow (unlikely though it seems) we end up keeping %s (e.g.
strictly to ease porting), we could also keep %r as an alias for %a.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

I see it now. b"foo%sbar" % b'baz' should also expand to b"foob'foo'bar"

Instead of "%b" could "%j" mean "I should have used + or join() here
but was too lazy" and work on str too?

On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy  wrote:
> On 1/13/2014 1:40 PM, Brett Cannon wrote:
>
>> > So bytes formatting really needn't (and shouldn't, IMO) mirror str
>> > formatting.
>
>
> This was my presumption in writing byteformat().
>
>
>> I think one of the things about Guido's proposal that bugs me is that it
>> breaks the mental model of the .format() method from str in terms of how
>> the mini-language works. For str.format() you have the conversion and
>> the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the
>> conversion by calling the appropriate built-in, e.g. 'r' calls repr().
>> The format spec semantically gets passed with the object to format()
>> which calls the object's __format__() method: ``format(number, 'd')``.
>>
>> Now Guido's suggestion has two parts that affect the mini-language for
>> .format(). One is that for bytes.format() the default conversion is
>> bytes() instead of str(), which is fine (probably want to add 'b' as a
>> conversion value as well to be consistent). But the other bit is that
>> the format spec goes from semantically meaning ``format(thing,
>> format_spec)`` to ``format(thing, format_spec).encode('ascii',
>> 'strict')`` for at least numbers. That implicitness bugs me as I have
>> always thought of format specs just leading to a call to format(). I
>> think I can live with it, though, as long as it is **consistently**
>> applied across the board for bytes.format(); every use of a format spec
>> leads to calling ``format(thing, format_spec).encode('ascii',
>> 'strict')`` no matter what type 'thing' would be and it is clearly
>> documented that this is done to ease porting and handle the common case
>> then I can live with it.
>
>
> This is how my byteformat function works, except that when no format_spec is
> given, byte and bytearrary objects are left unchanged rather than being
> decoded and encoded again.
>
>
>> This even gives people in-place ASCII encoding for strings by always
>> using '{:s}' with text which they can do when they port their code to
>> run under both Python 2 and 3. So you should be able to do
>> ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
>> If you want more explicit encoding to latin-1 then you need to do it
>> explicitly and not rely on the mini-language to do tricks for you.
>>
>> IOW I want to treat the format mini-language as a language and thus not
>> have any special-casing or massive shifts in meaning between
>> str.format() and bytes.format() so my mental model doesn't have to
>> contort based on whether it's str or bytes. My preference is not have
>> any, but if Guido is going say PBP here then I want absolute consistency
>> across the board in how bytes.format() tweaks things.
>>
>> As for %s for the % operator calling ascii(), I think that will be a
>> porting nightmare of finding out why your bytes suddenly stopped being
>> formatted properly and then having to crawl through all of your code for
>> that one use of %s which is getting bytes in. By raising a TypeError you
>> will very easily detect where your screw-up occurred thanks to the
>> traceback; do so otherwise feels too much like implicit type conversion
>> and ask any JavaScript developer how that can be a bad thing.
>
>
> I personally would not add 'bytes % whatever'.
>
> --
> Terry Jan Reedy
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 2:51 PM, Terry Reedy  wrote:

> On 1/13/2014 1:40 PM, Brett Cannon wrote:
>
>  > So bytes formatting really needn't (and shouldn't, IMO) mirror str
>> > formatting.
>>
>
> This was my presumption in writing byteformat().
>
>
>  I think one of the things about Guido's proposal that bugs me is that it
>> breaks the mental model of the .format() method from str in terms of how
>> the mini-language works. For str.format() you have the conversion and
>> the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the
>> conversion by calling the appropriate built-in, e.g. 'r' calls repr().
>> The format spec semantically gets passed with the object to format()
>> which calls the object's __format__() method: ``format(number, 'd')``.
>>
>> Now Guido's suggestion has two parts that affect the mini-language for
>> .format(). One is that for bytes.format() the default conversion is
>> bytes() instead of str(), which is fine (probably want to add 'b' as a
>> conversion value as well to be consistent). But the other bit is that
>> the format spec goes from semantically meaning ``format(thing,
>> format_spec)`` to ``format(thing, format_spec).encode('ascii',
>> 'strict')`` for at least numbers. That implicitness bugs me as I have
>> always thought of format specs just leading to a call to format(). I
>> think I can live with it, though, as long as it is **consistently**
>> applied across the board for bytes.format(); every use of a format spec
>> leads to calling ``format(thing, format_spec).encode('ascii',
>> 'strict')`` no matter what type 'thing' would be and it is clearly
>> documented that this is done to ease porting and handle the common case
>> then I can live with it.
>>
>
> This is how my byteformat function works, except that when no format_spec
> is given, byte and bytearrary objects are left unchanged rather than being
> decoded and encoded again.


Right, which is what the default conversion covers. And as your code shows
this can be made available today without having to wait for Python 3.5 and
so can go up on PyPI and be used **today**.


>
>
>  This even gives people in-place ASCII encoding for strings by always
>> using '{:s}' with text which they can do when they port their code to
>> run under both Python 2 and 3. So you should be able to do
>> ``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
>> If you want more explicit encoding to latin-1 then you need to do it
>> explicitly and not rely on the mini-language to do tricks for you.
>>
>> IOW I want to treat the format mini-language as a language and thus not
>> have any special-casing or massive shifts in meaning between
>> str.format() and bytes.format() so my mental model doesn't have to
>> contort based on whether it's str or bytes. My preference is not have
>> any, but if Guido is going say PBP here then I want absolute consistency
>> across the board in how bytes.format() tweaks things.
>>
>> As for %s for the % operator calling ascii(), I think that will be a
>> porting nightmare of finding out why your bytes suddenly stopped being
>> formatted properly and then having to crawl through all of your code for
>> that one use of %s which is getting bytes in. By raising a TypeError you
>> will very easily detect where your screw-up occurred thanks to the
>> traceback; do so otherwise feels too much like implicit type conversion
>> and ask any JavaScript developer how that can be a bad thing.
>>
>
> I personally would not add 'bytes % whatever'.


Personally, neither would I; just focus on bytes.format() and let %
operator on strings slowly go away.

-Brett


>
>
> --
> Terry Jan Reedy
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> brett%40python.org
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Barry Warsaw

On Jan 13, 2014, at 02:13 PM, Donald Stufft wrote:

>
>On Jan 13, 2014, at 1:58 PM, Guido van Rossum  wrote:
>
>> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
>> clear, and if the noise about that sub-issue is preventing folks from
>> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
>> use %b which would require its argument to be bytes. Those bytes
>> should still probably be ASCII-ish, but there's no way to test that.
>> That's fine with me and should be fine to Nick as well -- PEP 460
>> doesn't check that your encodings match (how could it? :-), nor does
>> plain string concatenation using +.
>
>I think disallowing %s is the right thing to do, but I definitely think
>numbers and %b should be allowed.

I guess I agree.  The behavior of b'%s' % 'x' returning b"'x'" is almost
always useless at best.  (I would have thought maybe %a for ascii() but don't
care that strongly.)

-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On 1/13/2014 1:40 PM, Brett Cannon wrote:

> So bytes formatting really needn't (and shouldn't, IMO) mirror str
> formatting.

This was my presumption in writing byteformat().

I think one of the things about Guido's proposal that bugs me is that it
breaks the mental model of the .format() method from str in terms of how
the mini-language works. For str.format() you have the conversion and
the format spec (e.g. "{!r}" and "{:d}", respectively). You apply the
conversion by calling the appropriate built-in, e.g. 'r' calls repr().
The format spec semantically gets passed with the object to format()
which calls the object's __format__() method: ``format(number, 'd')``.

Now Guido's suggestion has two parts that affect the mini-language for
.format(). One is that for bytes.format() the default conversion is
bytes() instead of str(), which is fine (probably want to add 'b' as a
conversion value as well to be consistent). But the other bit is that
the format spec goes from semantically meaning ``format(thing,
format_spec)`` to ``format(thing, format_spec).encode('ascii',
'strict')`` for at least numbers. That implicitness bugs me as I have
always thought of format specs just leading to a call to format(). I
think I can live with it, though, as long as it is **consistently**
applied across the board for bytes.format(); every use of a format spec
leads to calling ``format(thing, format_spec).encode('ascii',
'strict')`` no matter what type 'thing' would be and it is clearly
documented that this is done to ease porting and handle the common case
then I can live with it.

This is how my byteformat function works, except that when no 
format_spec is given, byte and bytearrary objects are left unchanged 
rather than being decoded and encoded again.

This even gives people in-place ASCII encoding for strings by always
using '{:s}' with text which they can do when they port their code to
run under both Python 2 and 3. So you should be able to do
``b'Content-Type: {:s}'.format('image/jpeg')`` and have it give ASCII.
If you want more explicit encoding to latin-1 then you need to do it
explicitly and not rely on the mini-language to do tricks for you.

IOW I want to treat the format mini-language as a language and thus not
have any special-casing or massive shifts in meaning between
str.format() and bytes.format() so my mental model doesn't have to
contort based on whether it's str or bytes. My preference is not have
any, but if Guido is going say PBP here then I want absolute consistency
across the board in how bytes.format() tweaks things.

As for %s for the % operator calling ascii(), I think that will be a
porting nightmare of finding out why your bytes suddenly stopped being
formatted properly and then having to crawl through all of your code for
that one use of %s which is getting bytes in. By raising a TypeError you
will very easily detect where your screw-up occurred thanks to the
traceback; do so otherwise feels too much like implicit type conversion
and ask any JavaScript developer how that can be a bad thing.

I personally would not add 'bytes % whatever'.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On Jan 13, 2014, at 1:58 PM, Guido van Rossum  wrote:

> I hear the objections against b'%s' % 'x' returning b"'x'" loud and
> clear, and if the noise about that sub-issue is preventing folks from
> seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
> use %b which would require its argument to be bytes. Those bytes
> should still probably be ASCII-ish, but there's no way to test that.
> That's fine with me and should be fine to Nick as well -- PEP 460
> doesn't check that your encodings match (how could it? :-), nor does
> plain string concatenation using +.

I think disallowing %s is the right thing to do, but I definitely think numbers
and %b should be allowed.

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

Let me try rebooting the reboot.

My interpretation of Nick's argument is that he are asking for a bytes
formatting language that doesn't have an implicit ASCII assumption.

To me this feels absurd. The formatting codes (%s, %c) themselves are
expressed as ASCII characters. If you include anything else in the
format string besides formatting codes (e.g. b'<%s>'), you are giving
it as ASCII characters. I don't know what characters the EBCDIC codes
37, 99 or 115 encode (these are the ASCII codes for '%', 'c', 's') but
it certainly wouldn't be safe to use % when the LHS is EBCDIC-encoded.

If I had some byte strings in an unknown encoding (but the same
encoding for all) that I needed to concatenate I would never think of
'%s%s' % (x, y) -- I would write x+y. (Even in Python 2.)

If I see some code using *any* formatting operation (regardless of
whether it's %d, %r, %s or %c) I am going to assume that there is some
ASCII-ness, and if there isn't, the code's author has obscured their
goal to me.

I hear the objections against b'%s' % 'x' returning b"'x'" loud and
clear, and if the noise about that sub-issue is preventing folks from
seeing the absurdity in PEP 460, we can talk about a compromise, e.g.
use %b which would require its argument to be bytes. Those bytes
should still probably be ASCII-ish, but there's no way to test that.
That's fine with me and should be fine to Nick as well -- PEP 460
doesn't check that your encodings match (how could it? :-), nor does
plain string concatenation using +.

In my head I make the following classification of situations where you
work with bytes and/or text.

(A) Pure binary formats (e.g. most IP-level packet formats, media
files, .pyc files, tar/zip files, compressed data, etc.). These are
handled using the struct module (e.g. tar/zip) and/or custom C
extensions (e.g. gzip).

(B) Encoded text. Here you should just decode everything into str
objects and parse your text at that level. If you really want to
manipulate the data as bytes (e.g. because you have a lot of data to
process and very light processing) you may be able to do it, but
unless it's a verbatim copy, you are probably going to make
assumptions about the encoding. You are also probably going to mess up
for some encodings (e.g. leave BOM turds in the middle of a file).

(C) Loosely text-based protocols and formats that have an ASCII
assumption in the spec. Most classic Internet protocols (FTP, SMTP,
HTTP, IRC, etc.) fall in this category; I expect there are also plenty
of file formats using similar conventions (e.g. mailbox files). These
protocols and formats often require text-ish manipulations, e.g. for
case-insensitive headers or commands, or to split things at
whitespace. This is where I find uses for the current ASCII-assuming
bytes operations (e.g. b.lower(), b.split(), but also int(b)) and
where the lack of number formatting (especially %d and %x) is most
painful. I see no benefit in forcing the programmer writing such
protocol code handling to use more cumbersome ways of converting
between numbers and bytes, nor in forcing them to insert an
encoding/decoding layer -- these protocols often switch between text
and binary data at line boundaries, so the most basic part of parsing
(splitting the input into lines) must still happen in the realm of
bytes.

IMO PEP 460 and the mindset that goes with it don't apply to any of
these three cases.

Also, IMO requiring a new type to handle (C) also seems adding too
much complexity, and adds to porting efforts. I may have felt
differently in the past, but ATM I feel that if newer versions of
Python 3 make porting of Python 2 code easier, through minor
compromises, that's a *good* thing. (Example: adding u"..." literals
to 3.3.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On 01/13/2014 09:12 AM, Nick Coghlan wrote:

On 14 January 2014 01:54, Ethan Furman wrote:


Forgive me for being dense, but I don't understand your objection.  With
Guido's proposal, '%s' % bytes_data, bytes_data is passed through unchanged.
Did you mean something else by "binary data"?


I mean it will work, but it will mean you've introduced an implicit
assumption of ASCII compatibility into the structure your program


Okay, I'm still trying to understand.  Apparently we both mean the same thing by binary data / bytes, so the difference 
must be the %s, yes?  And the concern as that because you have used %s as the format code, if somebody accidentally put, 
say, "stupid bug" on the RHS you would end up with b"'stupid bug'" instead of an exception, which you get if you had 
used %b instead.  Am I following?


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot


On Jan 13, 2014, at 1:45 PM, Daniel Holth  wrote:

> On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray  
> wrote:
>> On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou  
>> wrote:
>>> On Sun, 12 Jan 2014 18:11:47 -0800
>>> Guido van Rossum  wrote:
 On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman  wrote:
> On 01/12/2014 04:47 PM, Guido van Rossum wrote:
>> %s seems the trickiest: I think with a bytes argument it should just
>> insert those bytes (and the padding modifiers should work too), and
>> for other types it should probably work like %a, so that it works as
>> expected for numeric values, and with a string argument it will return
>> the ascii()-variant of its repr(). Examples:
>> 
>> b'%s' % 42 == b'42'
>> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
>> enclosed in single quotes)
> 
> I'm not sure about the quotes.  Would anyone ever actually want those in 
> the
> byte stream?
 
 Perhaps not, but it's a hint that you should probably think about an
 encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of
 it as payback time. :-)
>>> 
>>> What is the use case for embedding a quoted ASCII-encoded representation
>>> in a byte stream?
>> 
>> There is no use case in the sense you are asking, just like there is no
>> real use case for '%s' % b'x' producing "b'x'".  But the real use case
>> is exactly the same: to let you know your code is screwed up without
>> actually blowing up with a encoding Exception.
>> 
>> For the record, I like Guido's logic and proposal.  I don't understand
>> Nick's objection, since I don't see the difference between the situation
>> here where a string gets interpolated into bytes as 'xxx' and the
>> corresponding situation where bytes gets interpolated into a string
>> as b'xxx'.  Why struggle to keep bytes interpolation "pure" if string
>> interpolation isn't?
>> 
>> Guido's proposal makes the language more symmetric, and thus more
>> consistent and less surprising.  Exactly the hallmarks of Python's design
>> sense, IMO.  (Big surprise, right? :)
>> 
>> Of course, this point of view *is* based on the idea that when you are
>> doing interpolation using %/.format, you are in fact primarily concerned
>> with ASCII compatible byte streams.  This is a Practicality sort of
>> argument.  It is, after all, by far the most common use case when
>> doing interpolation[*].
>> 
>> If you wanted to do a purist version of this symmetry, you'd have bytes(x)
>> calling __bytes__ if it was defined and falling back to calling a
>> __brepr__ otherwise.
>> 
>> But what would __brepr__ implement?  The variety of format codes in
>> the struct module argues that there is no "one obvious" binary
>> repr for most types.  (Those that have one would implement __bytes__).
>> And what would be the __brepr__ of an arbitrary 'object'?
>> 
>> Faced with the impracticality of defining __brepr__ usefully in any "pure
>> bytes" form, it seems sensible to admit that the most useful __brepr__
>> is the ascii() encoding of the __repr__.  Which naturally produces 'xxx'
>> as the __brepr__ of a string.
>> 
>> This does cause things to get a little un-pretty when you are operating
>> at the python prompt:
>> 
> b'%s' % object
>>b'""'
>> 
>> But then again that is most likely really not what you mean to do, so
>> it becomes a big red flag...just like b'xxx' is a small red flag when
>> you accidentally interpolate unencoded bytes into a string.
>> 
>> --David
>> 
>> PS: When I first read Guido's remark that the result of interpolating a
>> string should be 'xxx', I went Wah?  I had to reason my way through to
>> it as above, but to him it was just the natural answer.  Guido isn't
>> always right, but this kind of automatic language design consistency
>> is one reason he's the BDFL.
>> 
>> [*] I still think that you mostly want to design your library so that
>> you are handling the text parts as text and the bytes parts as bytes,
>> and encoding/gluing them as appropriate at the IO boundary.  But if Guido
>> says his real code would benefit by being able to interpolate ASCII into
>> bytes at certain points, I'll believe him.
> 
> 
> 
> If you think corrupted data is easier or more pleasant to track down
> than encoding exceptions then I think you are strange. It makes
> porting really difficult while you are still trying to figure out
> where the bytes/str boundaries are. I am now deeply suspicious of all
> % formatting.
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/donald%40stufft.io

For the record, I think %d and %f and such where the RHS is guaranteed to have a
certain set of “characters” that are guaranteed to be ascii compatible is fine 
and it’s
perfectly acceptable to have an implicit ASCII encode for them. The %s code

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 12:42 PM, R. David Murray  wrote:
> On Mon, 13 Jan 2014 12:41:18 +0100, Antoine Pitrou  
> wrote:
>> On Sun, 12 Jan 2014 18:11:47 -0800
>> Guido van Rossum  wrote:
>> > On Sun, Jan 12, 2014 at 5:27 PM, Ethan Furman  wrote:
>> > > On 01/12/2014 04:47 PM, Guido van Rossum wrote:
>> > >> %s seems the trickiest: I think with a bytes argument it should just
>> > >> insert those bytes (and the padding modifiers should work too), and
>> > >> for other types it should probably work like %a, so that it works as
>> > >> expected for numeric values, and with a string argument it will return
>> > >> the ascii()-variant of its repr(). Examples:
>> > >>
>> > >> b'%s' % 42 == b'42'
>> > >> b'%s' % 'x' == b"'x'" (i.e. the three-byte string containing an 'x'
>> > >> enclosed in single quotes)
>> > >
>> > > I'm not sure about the quotes.  Would anyone ever actually want those in 
>> > > the
>> > > byte stream?
>> >
>> > Perhaps not, but it's a hint that you should probably think about an
>> > encoding. It's symmetric with how '%s' % b'x' returns "b'x'". Think of
>> > it as payback time. :-)
>>
>> What is the use case for embedding a quoted ASCII-encoded representation
>> in a byte stream?
>
> There is no use case in the sense you are asking, just like there is no
> real use case for '%s' % b'x' producing "b'x'".  But the real use case
> is exactly the same: to let you know your code is screwed up without
> actually blowing up with a encoding Exception.
>
> For the record, I like Guido's logic and proposal.  I don't understand
> Nick's objection, since I don't see the difference between the situation
> here where a string gets interpolated into bytes as 'xxx' and the
> corresponding situation where bytes gets interpolated into a string
> as b'xxx'.  Why struggle to keep bytes interpolation "pure" if string
> interpolation isn't?
>
> Guido's proposal makes the language more symmetric, and thus more
> consistent and less surprising.  Exactly the hallmarks of Python's design
> sense, IMO.  (Big surprise, right? :)
>
> Of course, this point of view *is* based on the idea that when you are
> doing interpolation using %/.format, you are in fact primarily concerned
> with ASCII compatible byte streams.  This is a Practicality sort of
> argument.  It is, after all, by far the most common use case when
> doing interpolation[*].
>
> If you wanted to do a purist version of this symmetry, you'd have bytes(x)
> calling __bytes__ if it was defined and falling back to calling a
> __brepr__ otherwise.
>
> But what would __brepr__ implement?  The variety of format codes in
> the struct module argues that there is no "one obvious" binary
> repr for most types.  (Those that have one would implement __bytes__).
> And what would be the __brepr__ of an arbitrary 'object'?
>
> Faced with the impracticality of defining __brepr__ usefully in any "pure
> bytes" form, it seems sensible to admit that the most useful __brepr__
> is the ascii() encoding of the __repr__.  Which naturally produces 'xxx'
> as the __brepr__ of a string.
>
> This does cause things to get a little un-pretty when you are operating
> at the python prompt:
>
> >>> b'%s' % object
> b'""'
>
> But then again that is most likely really not what you mean to do, so
> it becomes a big red flag...just like b'xxx' is a small red flag when
> you accidentally interpolate unencoded bytes into a string.
>
> --David
>
> PS: When I first read Guido's remark that the result of interpolating a
> string should be 'xxx', I went Wah?  I had to reason my way through to
> it as above, but to him it was just the natural answer.  Guido isn't
> always right, but this kind of automatic language design consistency
> is one reason he's the BDFL.
>
> [*] I still think that you mostly want to design your library so that
> you are handling the text parts as text and the bytes parts as bytes,
> and encoding/gluing them as appropriate at the IO boundary.  But if Guido
> says his real code would benefit by being able to interpolate ASCII into
> bytes at certain points, I'll believe him.



If you think corrupted data is easier or more pleasant to track down
than encoding exceptions then I think you are strange. It makes
porting really difficult while you are still trying to figure out
where the bytes/str boundaries are. I am now deeply suspicious of all
% formatting.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot

On Mon, Jan 13, 2014 at 12:31 PM, Antoine Pitrou wrote:

> On Mon, 13 Jan 2014 08:36:05 -0800
> Ethan Furman  wrote:
>
> > On 01/13/2014 08:09 AM, Antoine Pitrou wrote:
> > > On Mon, 13 Jan 2014 07:59:10 -0800
> > > Guido van Rossum  wrote:
> > >> On Mon, Jan 13, 2014 at 3:41 AM, Antoine Pitrou 
> wrote:
> > >>> What is the use case for embedding a quoted ASCII-encoded
> representation
> > >>> in a byte stream?
> > >>
> > >> It doesn't crash but produces undesired output (always, not only when
> > >> the data is non-ASCII) that gives the developer a hint to think about
> > >> encoding to bytes.
> > >
> > > But why is it better to give a hint by producing undesired output
> (which
> > > may actually go unnoticed for some time and produce issues down the
> > > road), rather than simply by raising TypeError?
> >
> > You mean crash all the time?  I'd be fine with that for both the str case
> > and the bytes case.  But's probably too late
> > to change the str case, and the bytes case should mirror what str does.
>
> Let me add something else: str and bytes don't have to be symmetrical.
> In Python 2, str and unicode were symmetrical, they allowed exactly the
> same operations and were composable.
> In Python 3, str and bytes are different beasts; they have different
> operations *and* different semantics (for example, bytes interoperates
> with bytearray and memoryview, while str doesn't).
>

This is also why the int type doesn't have a __bytes__ method (ignoring the
use of an integer to bytes()): it's universally defined what str(10) should
return, but who know what you want when you would want the bytes of 10
(e.g. base-2, ASCII, UTF-16, etc.).

>
> So bytes formatting really needn't (and shouldn't, IMO) mirror str
> formatting.
>

I think one of the things about Guido's proposal that bugs me is that it
breaks the mental model of the .format() method from str in terms of how
the mini-language works. For str.format() you have the conversion and the
format spec (e.g. "{!r}" and "{:d}", respectively). You apply the
conversion by calling the appropriate built-in, e.g. 'r' calls repr(). The
format spec semantically gets passed with the object to format() which
calls the object's __format__() method: ``format(number, 'd')``.

Now Guido's suggestion has two parts that affect the mini-language for
.format(). One is that for bytes.format() the default conversion is bytes()
instead of str(), which is fine (probably want to add 'b' as a conversion
value as well to be consistent). But the other bit is that the format spec
goes from semantically meaning ``format(thing, format_spec)`` to
``format(thing, format_spec).encode('ascii', 'strict')`` for at least
numbers. That implicitness bugs me as I have always thought of format specs
just leading to a call to format(). I think I can live with it, though, as
long as it is **consistently** applied across the board for bytes.format();
every use of a format spec leads to calling ``format(thing,
format_spec).encode('ascii', 'strict')`` no matter what type 'thing' would
be and it is clearly documented that this is done to ease porting and
handle the common case then I can live with it.

This even gives people in-place ASCII encoding for strings by always using
'{:s}' with text which they can do when they port their code to run under
both Python 2 and 3. So you should be able to do ``b'Content-Type:
{:s}'.format('image/jpeg')`` and have it give ASCII. If you want more
explicit encoding to latin-1 then you need to do it explicitly and not rely
on the mini-language to do tricks for you.

IOW I want to treat the format mini-language as a language and thus not
have any special-casing or massive shifts in meaning between str.format()
and bytes.format() so my mental model doesn't have to contort based on
whether it's str or bytes. My preference is not have any, but if Guido is
going say PBP here then I want absolute consistency across the board in how
bytes.format() tweaks things.

As for %s for the % operator calling ascii(), I think that will be a
porting nightmare of finding out why your bytes suddenly stopped being
formatted properly and then having to crawl through all of your code for
that one use of %s which is getting bytes in. By raising a TypeError you
will very easily detect where your screw-up occurred thanks to the
traceback; do so otherwise feels too much like implicit type conversion and
ask any JavaScript developer how that can be a bad thing.

-Brett

>
> (the only reason I used "%s" in PEP 460 is to allow a migration path
> from 2.x bytes-formatting to 3.x bytes-formatting; in a really "pure"
> proposal it would have been called something else)
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
___
Python-D

Re: [Python-Dev] PEP 460 reboot

2014-01-13 Thread Georg Brandl

Am 13.01.2014 18:38, schrieb Ethan Furman:
> On 01/13/2014 09:31 AM, Antoine Pitrou wrote:
>> On Mon, 13 Jan 2014 08:36:05 -0800 Ethan Furman wrote:
>>> 
>>> You mean crash all the time?  I'd be fine with that for both the str
>>> case and the bytes case.  But's probably too late to change the str case,
>>> and the bytes case should mirror what str does.
>> 
>> Let me add something else: str and bytes don't have to be symmetrical. In
>> Python 2, str and unicode were symmetrical, they allowed exactly the same
>> operations and were composable. In Python 3, str and bytes are different
>> beasts; they have different operations *and* different semantics (for
>> example, bytes interoperates with bytearray and memoryview, while str
>> doesn't).
> 
> This makes sense to me.
> 
> So I'm guess I'm fine with either the quoted ascii repr or the always blowing
> up method, with leaning towards the blowing up method.

+1.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 460 reboot