Re: [Python-Dev] PEP 461 Final?

2014-01-20 Thread Ethan Furman

On 01/19/2014 11:10 PM, Stephen J. Turnbull wrote:

Ethan Furman writes:

This argument is specious.
  
   I don't think so.  I think it's a good argument for the future of
   Python code.

I agree that restricting bytes '%'-formatting to ASCII is a good idea,
but you should base your arguments on a correct description of what's
going on.  It's not an issue of representability.  It's an issue of
we should support this for ASCII because it's a useful, nearly
universal convention, and we should not support ASCII supersets
because that leads to mojibake.

   Then you could have your text /and/ your numbers be in your own
   language.

My language uses numerals other than those in the ASCII repertoire in
a rather stylized way.  I can't use __format__ for that, because it
depends on context, anyway.  Most of the time the digits in the ASCII
set are used (especially in tables and the like).  I believe that's
true for all languages nowadays.

   Lots of features can be abused.  That doesn't mean we shouldn't
   talk about the intended use cases and encourage those.

I only objected to claims that issues of representability and what
I can do with __format__ support the preferred use cases, not to
descriptions of the preferred use cases.


Thank you.  I appreciate your time.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Steven D'Aprano
On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:
 On 01/17/2014 05:27 PM, Steven D'Aprano wrote:

 Numeric Format Codes
 
 
 To properly handle int and float subclasses, int(), index(), and float()
 will be called on the objects intended for (d, i, u), (b, o, x, X), and
 (e, E, f, F, g, G).
 
 
 -1 on this idea.
 
 This is a rather large violation of the principle of least surprise, and
 radically different from the behaviour of Python 3 str. In Python 3,
 '%d' interpolation calls the __str__ method, so if you subclass, you can
 get the behaviour you want:
 
 Did you read the bug reports I linked to?  This behavior (which is a bug) 
 has already been fixed for Python3.4.

No I didn't. This thread is huge, and it's only one of a number of huge 
threads about the same bytes/unicode Python 2/3 stuff. I'm probably 
not the only person who missed the bug reports you linked to.

If these bug reports are relevant to the PEP, you ought to list them in 
the PEP, and if they aren't relevant, I shan't be reading them *wink*

In any case, whether I have succeeded in making the case against this 
aspect of the PEP or not, I think you should:

- explain what you mean by properly handle (give an example?);

- justify why b'%d' % obj should ignore any relevant overloaded 
  methods in obj;

- if there are similar, existing, examples of this (to me) 
  surprising behaviour, you should briefly mention them;

- note that there was some opposition to the suggestion;

- and explain why the contrary behaviour (i.e. allowing obj to
  overload b'%d') is not desirable.



 As a quick thought experiment, why does %d % True return 1?

I don't know. Perhaps it is a bug?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Ethan Furman

On 01/19/2014 03:37 AM, Steven D'Aprano wrote:

On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:

On 01/17/2014 05:27 PM, Steven D'Aprano wrote:



Numeric Format Codes


To properly handle int and float subclasses, int(), index(), and float()
will be called on the objects intended for (d, i, u), (b, o, x, X), and
(e, E, f, F, g, G).



-1 on this idea.

This is a rather large violation of the principle of least surprise, and
radically different from the behaviour of Python 3 str. In Python 3,
'%d' interpolation calls the __str__ method, so if you subclass, you can
get the behaviour you want:


Did you read the bug reports I linked to?  This behavior (which is a bug)
has already been fixed for Python3.4.


No I didn't. This thread is huge, and it's only one of a number of huge
threads about the same bytes/unicode Python 2/3 stuff. I'm probably
not the only person who missed the bug reports you linked to.


Fair point.



If these bug reports are relevant to the PEP, you ought to list them in
the PEP, and if they aren't relevant, I shan't be reading them *wink*


mischievous grin
Well, it seems to me they are more relevant to your misunderstanding of how %d and friends should work rather than to 
the PEP itself.  However, I suppose it possible you're not the only one so affected, so I'll link them in.

/mischeivous grin



In any case, whether I have succeeded in making the case against this
aspect of the PEP or not


Not.  This was a bug that was fixed long before the PEP came into existence.



As a quick thought experiment, why does %d % True return 1?


I don't know. Perhaps it is a bug?


To summarize a rather long issue, %d and friends are /numeric/ codes; returning non-numeric text is inappropriate.  Yes, 
I realize there are other unicode values than also mean numeric digits, but they do not mean (so far as I know) Decimal 
digits, or Hexadecimal digits, or Octal digits.  (Obviously an ASCII slant going on there.)


Now that I've written that down, I think there are, in fact, other scripts that represent a base-10 number system with 
obviously different glyphs for the numbers  Well, that means that this PEP just further strengthens the notion that 
format is for text (as then a custom numeric type could easily override the display even for :d, :h, etc.) and % is for 
bytes (where such glyphs are not natively representable anyway).


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Ethan Furman

On 01/19/2014 03:37 AM, Steven D'Aprano wrote:

On Fri, Jan 17, 2014 at 05:51:05PM -0800, Ethan Furman wrote:

On 01/17/2014 05:27 PM, Steven D'Aprano wrote:



Numeric Format Codes


To properly handle int and float subclasses, int(), index(), and float()
will be called on the objects intended for (d, i, u), (b, o, x, X), and
(e, E, f, F, g, G).



-1 on this idea.


I went to add examples to this section of the PEP, and realized I was just describing what Python does anyway.  So it 
doesn't need to be in the PEP.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Stephen J. Turnbull
Ethan Furman writes:

  Well, that means that this PEP just further strengthens the notion
  that format is for text (as then a custom numeric type could easily
  override the display even for :d, :h, etc.) and % is for bytes
  (where such glyphs are not natively representable anyway).

This argument is specious.

Alternative numeric characters just as representable as the ASCII
digits are, and in the same way (by defining a bytes - str mapping,
aka codec).  The problem is not that they're non-representable, it's
that they're non-ASCII, and the numeric format codes implicitly
specify the ASCII numerals when in text as well as when in bytes.

There's no technical reason why these features couldn't use EBCDIC or
even UTF-16 nowadays.  It's purely a convention.  But it's a very
useful convention, so it's helpful if Python conforms to it.  (Note
that {:d}.format(True) - '1' works because True *is* an int and so
can be d-formatted in principle.  It's not an exceptional case.  It's
a different issue from what you're talking about here.)

The problem that EIBTI worries about is that in many places there is a
local convention to use not pure ASCII, but a specific ASCII superset.
This allows them to take advantage of the common convention of using
ASCII for protocol keywords, and at the same time using legacy
facilities for internal processing of text.  Becoming a disadvantage
if and when such programs need to communicate with internationalized
applications.

These PEPs provide a crutch for such crippled software, allowing them
to hobble into the House of Python 3.  That's obvious, so please don't
try to obfuscate it; just declare consenting adults and move on.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Ethan Furman

On 01/19/2014 06:56 PM, Stephen J. Turnbull wrote:

Ethan Furman writes:


Well, that means that this PEP just further strengthens the notion
that format is for text (as then a custom numeric type could easily
override the display even for :d, :h, etc.) and % is for bytes
(where such glyphs are not natively representable anyway).


This argument is specious.


I don't think so.  I think it's a good argument for the future of Python code.  Mind you, I should probably have said % 
is primarily for bytes, or even more useful for bytes than for text.  The idea being that true text fun stuff requires 
format, while bytes can only use % for easy formatting.




Alternative numeric characters [are] just as representable as the ASCII
digits are, and in the same way (by defining a bytes - str mapping,
aka codec).  The problem is not that they're non-representable, it's
that they're non-ASCII, and the numeric format codes implicitly
specify the ASCII numerals when in text as well as when in bytes.


Certainly.  And you can't change that either.  Oh, wait, you can!  Define your 
own!

class LocalNum(int):
displays d, i, and u codes in local language

def __format__(self, fmt):
# do the fancy stuff so the characters are not ASCII, but whatever
# is local here

Then you could have your text /and/ your numbers be in your own language.  But you can't get that using % unless you 
always call a custom function and use %s.




(Note
that '{:d}'.format(True) - '1' works because True *is* an int and so
can be d-formatted in principle.  It's not an exceptional case.  It's
a different issue from what you're talking about here.)


'{:d}'.format(True) is not exceptional, you're right.  But '%d' % True is, and was singled-out in the unicode 
display code to print as '1' and not as 'True'.  (Now all int subclasses behave this way (in 3.4 anyways).)


And I think it's the same issue, or at least closely related.  If you create a custom number type with the intention of 
displaying them in the local lingo, you have to use __format__ because % is hard coded to yield digits that map to ASCII.




These PEPs provide a crutch for such crippled software, allowing them
to hobble into the House of Python 3.


Very picturesque.


That's obvious, so please don't try to obfuscate it; just declare
 consenting adults and move on.


Lots of features can be abused.  That doesn't mean we shouldn't talk about the 
intended use cases and encourage those.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-19 Thread Stephen J. Turnbull
Ethan Furman writes:

   This argument is specious.
  
  I don't think so.  I think it's a good argument for the future of
  Python code.

I agree that restricting bytes '%'-formatting to ASCII is a good idea,
but you should base your arguments on a correct description of what's
going on.  It's not an issue of representability.  It's an issue of
we should support this for ASCII because it's a useful, nearly
universal convention, and we should not support ASCII supersets
because that leads to mojibake.

  Then you could have your text /and/ your numbers be in your own
  language.

My language uses numerals other than those in the ASCII repertoire in
a rather stylized way.  I can't use __format__ for that, because it
depends on context, anyway.  Most of the time the digits in the ASCII
set are used (especially in tables and the like).  I believe that's
true for all languages nowadays.

  Lots of features can be abused.  That doesn't mean we shouldn't
  talk about the intended use cases and encourage those.

I only objected to claims that issues of representability and what
I can do with __format__ support the preferred use cases, not to
descriptions of the preferred use cases.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Antoine Pitrou
On Fri, 17 Jan 2014 08:49:21 -0800
Ethan Furman et...@stoneleaf.us wrote:
 
 PEP: 461

There are formatting issues in the HTML rendering, I think the ReST
code needs a bit massaging:
http://www.python.org/dev/peps/pep-0461/

 .. note::
 
 Because the str type does not have a __bytes__ method, attempts to
 directly use 'a string' as a bytes interpolation value will raise an
 exception.  To use 'string' values, they must be encoded or otherwise
 transformed into a bytes sequence::

s/'string' values/unicode strings/

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Nick Coghlan
On 18 Jan 2014 11:52, Ethan Furman et...@stoneleaf.us wrote:

 On 01/17/2014 05:27 PM, Steven D'Aprano wrote:

 On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:


 Overriding Principles
 =

 In order to avoid the problems of auto-conversion and Unicode
 exceptions that could plague Py2 code, all object checking will
 be done by duck-typing, not by values contained in a Unicode
  representation [3]_.


 I don't understand this paragraph. What does values contained in a
 Unicode representation mean?


 Yeah, that is clunky.  I'm trying to convey the idea that we don't want
errors based on content, i.e. which characters happens to be in a str.



 [...]

 %s is restricted in what it will accept::

- input type supports Py_buffer?
  use it to collect the necessary bytes


 Can you give some examples of what types support Py_buffer? Presumably
 bytes. Anything else?


 Anybody?  Otherwise I'll go spelunking in the code.

bytes, bytearray, memoryview, ctypes arrays, array.array, numpy.ndarrray

It may actually be clearer to express this in terms of memoryview for the
benefits of those that aren't familiar with the C API, as that is the
closest equivalent Python level API (while there is an open issue regarding
the C only nature of the buffer export API, nobody has volunteered to put
together a PEP and implementation for a Python level follow up to the C
level PEP 3118. The problem is that the original use cases involve C
extensions anyway, so the relevant experts don't have any personal need for
a Python level buffer exporter interface. Instead, it's in the should be
done for completeness, and would make some of our testing easier, but
doesn't have anyone clamouring for it bucket.




- input type is something else?
  use its __bytes__ method; if there isn't one, raise a TypeError


 I think you should explicitly state that this is a new special method,
 and state which built-in types will grow a __bytes__ method (if any).


 It's not new.  I know bytes, str, and numbers /do not/ have __bytes__.

Right, it is already used by bytes to convert arbitrary objects to a binary
representation. The difference with Py_buffer/memoryview is that they
provide access to the raw data without necessarily copying anything.

str and numbers don't implement it as there's no obvious default
interpretation (the b'\x00' * n interpretation of integers is part of the
bytes constructor and now a decision we mostly regret - it should have been
a keyword argument or a separate class method)




 Unsupported codes
 -

 %r (which calls __repr__), and %a (which calls ascii() on __repr__) are
not
 supported.


 +1 on not supporting b'%r' (i.e. I agree with the PEP).

 Why not support b'%a'? That seems to be a strange thing to prohibit.


 I'll admit to being somewhat on the fence about %a.

 It seems there are two possibilities with %a:

   1) have it be ascii(repr(obj))

   2) have it be str(obj).encode('ascii', 'strict')

This gets very close to crossing the line into implicit encoding of text
again. Binary interpolation is being added back for the specific use case
of working with ASCII compatible segments in binary formats, and it's at
best arguable that supporting %a will help with that use case.

However, without it, there may be a greater temptation to inappropriately
define __bytes__ just to support binary interpolation, rather than because
a type truly has an appropriate translation directly to bytes.

By allowing %a, we avoid that temptation. This is also potentially useful
specifically in the case of binary logging formats and as a quick way to
request backslash escaping of non-ASCII characters in text.

Call it +0.5 for allowing %a. I don't expect it to be used heavily, but I
think it will head off a fair bit of potential misuse of __bytes__.

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Ethan Furman

On 01/18/2014 03:40 AM, Antoine Pitrou wrote:

On Fri, 17 Jan 2014 08:49:21 -0800
Ethan Furman et...@stoneleaf.us wrote:


PEP: 461


There are formatting issues in the HTML rendering, I think the ReST
code needs a bit massaging:
http://www.python.org/dev/peps/pep-0461/


I'm not seeing the problems (could be I don't have enough experience to spot 
them).



.. note::

 Because the str type does not have a __bytes__ method, attempts to
 directly use 'a string' as a bytes interpolation value will raise an
 exception.  To use 'string' values, they must be encoded or otherwise
 transformed into a bytes sequence::


s/'string' values/unicode strings/


Fixed, thanks.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Ethan Furman

On 01/18/2014 05:48 AM, Nick Coghlan wrote:

On 18 Jan 2014 11:52, Ethan Furman wrote:


I'll admit to being somewhat on the fence about %a.

It seems there are two possibilities with %a:

  1) have it be ascii(repr(obj))

  2) have it be str(obj).encode('ascii', 'strict')


This gets very close to crossing the line into implicit encoding of text again. 
Binary interpolation is being added back
for the specific use case of working with ASCII compatible segments in binary 
formats, and it's at best arguable that
supporting %a will help with that use case.


Agreed.



However, without it, there may be a greater temptation to inappropriately 
define __bytes__ just to support binary
interpolation, rather than because a type truly has an appropriate translation 
directly to bytes.


True.



By allowing %a, we avoid that temptation. This is also potentially useful 
specifically in the case of binary logging
formats and as a quick way to request backslash escaping of non-ASCII 
characters in text.

Call it +0.5 for allowing %a. I don't expect it to be used heavily, but I think 
it will head off a fair bit of potential
misuse of __bytes__.


So, if %a is added it would act like:

-
  %a % some_obj
-
  tmp = str(some_obj)
  res = b''
  for ch in tmp:
  if ord(ch)  256:
  res += bytes([ord(ch)]
  else:
  res += unicode_escape(ch)
-

where 'unicode_escape' would yield something like \u0440 ?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Neil Schemenauer
Ethan Furman et...@stoneleaf.us wrote:
 So, if %a is added it would act like:

 -
%a % some_obj
 -
tmp = str(some_obj)
res = b''
for ch in tmp:
if ord(ch)  256:
res += bytes([ord(ch)]
else:
res += unicode_escape(ch)
 -

 where 'unicode_escape' would yield something like \u0440 ?

My patch on the tracker already implements %a, it's simple.  Just
call PyObject_ASCII() (same as ascii()) then call
PyUnicode_AsLatin1String(s) to convert it to bytes and stick it in.
PyObject_ASCII does not return non-ASCII characters, no decode error
is possible.  We could call _PyUnicode_AsASCIIString(s, strict)
instead if we are afraid for non-ASCII bytes coming out of
PyObject_ASCII.

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Neil Schemenauer
Steven D'Aprano st...@pearwood.info wrote:
 To properly handle int and float subclasses, int(), index(), and float()
 will be called on the objects intended for (d, i, u), (b, o, x, X), and
 (e, E, f, F, g, G).


 -1 on this idea.

 This is a rather large violation of the principle of least surprise, and 
 radically different from the behaviour of Python 3 str. In Python 3, 
 '%d' interpolation calls the __str__ method, so if you subclass, you can 
 get the behaviour you want:

 py class HexInt(int):
 ... def __str__(self):
 ... return hex(self)
 ...
 py %d % HexInt(23)
 '0x17'


 which is exactly what we should expect from a subclass.

 You're suggesting that bytes should ignore any custom display 
 implemented by subclasses, and implicitly coerce them to the superclass 
 int. What is the justification for this? You don't define or even 
 describe what you consider properly handle.

The proposed behavior (at least as I understand it and as I've
implemented in my proposed patch) matches Python 2 str/unicode and
Python 3 str behavior for these codes.  If you want to allow
subclasses to have control or to use duck-typing, you have to use
str and __format__.  I'm okay with the limitation, bytes formatting
can be simple, limited and fast.

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Ethan Furman

On 01/18/2014 05:21 PM, Neil Schemenauer wrote:

Ethan Furman et...@stoneleaf.us wrote:

So, if %a is added it would act like:

-
%a % some_obj
-
tmp = str(some_obj)
res = b''
for ch in tmp:
if ord(ch)  256:
res += bytes([ord(ch)]
else:
res += unicode_escape(ch)
-

where 'unicode_escape' would yield something like \u0440 ?


My patch on the tracker already implements %a, it's simple.


Before one implements a patch it is good to know the specifications.


Just call PyObject_ASCII() (same as ascii()) then call
PyUnicode_AsLatin1String(s) to convert it to bytes and stick it in.
PyObject_ASCII does not return non-ASCII characters, no decode error
is possible.  We could call _PyUnicode_AsASCIIString(s, strict)
instead if we are afraid for non-ASCII bytes coming out of
PyObject_ASCII.


I appreciate that this is the behavior you want, but I'm not sure it's the 
behavior Nick was describing.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Ethan Furman

On 01/18/2014 02:01 PM, Ethan Furman wrote:


where 'unicode_escape' would yield something like \u0440 ?


Just to be clear, \u0440 is the six bytes b'\\', b'u', b'0', b'4', b'4', b'0'.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-18 Thread Nick Coghlan
On 19 January 2014 12:34, Ethan Furman et...@stoneleaf.us wrote:
 On 01/18/2014 05:21 PM, Neil Schemenauer wrote:

 Ethan Furman et...@stoneleaf.us wrote:

 So, if %a is added it would act like:

 -
 %a % some_obj
 -
 tmp = str(some_obj)
 res = b''
 for ch in tmp:
 if ord(ch)  256:
 res += bytes([ord(ch)]
 else:
 res += unicode_escape(ch)
 -

 where 'unicode_escape' would yield something like \u0440 ?


 My patch on the tracker already implements %a, it's simple.


 Before one implements a patch it is good to know the specifications.

A very sound engineering principle :)

Neil has the resulting semantics right for what I had in mind, but the
faster path to bytes (rather than going through the ASCII builtin) is
to do the C level equivalent of:

repr(obj).encode(ascii, errors=backslashreplace)

That's essentially what the ascii() builtin does, but that operates
entirely in the text domain, so (as Neil found) you still need a
separate encode step at the end.

 ascii(è).encode(ascii)
b'\\xe8'
 repr(è).encode(ascii, errors=backslashreplace)
b'\\xe8'

b%a % è should produce the same result as the two examples above.
(Code points higher up in the Unicode code space would produce \u and
\U escapes as needed, which should already be handled properly by the
backslashreplace error handler)

One nice thing about this definition is that in the specific case of
text input, the transformation can always be reversed by decoding as
ASCII and then applying ast.literal_eval():

 import ast
 ast.literal_eval(repr(è).encode(ascii,
backslashreplace).decode(ascii))
'è'

(Please don't use eval() to reverse a transformation like this, as
doing so not only makes security engineers cry, it's also likely to
make your code vulnerable to all kinds of interesting attacks)

As noted earlier in the thread, one key purpose of including this
feature is to reduce the likelihood of people inappropriately adding
__bytes__ implementations for %s compatibility that look like:

def __bytes__(self):
# This is unlikely to be a good idea!
return repr(self).encode(ascii, errors=backslashreplace)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Brett Cannon
On Fri, Jan 17, 2014 at 11:49 AM, Ethan Furman et...@stoneleaf.us wrote:

 Here's the text for your reading pleasure.  I'll commit the PEP after I
 add some markup.

 Major change:

   - dropped `format` support, just using %-interpolation

 Coming soon:

   - Rationale section  ;)

 
 
 PEP: 461
 Title: Adding % formatting to bytes
 Version: $Revision$
 Last-Modified: $Date$
 Author: Ethan Furman et...@stoneleaf.us
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 2014-01-13
 Python-Version: 3.5
 Post-History: 2014-01-14, 2014-01-15, 2014-01-17
 Resolution:


 Abstract
 

 This PEP proposes adding % formatting operations similar to Python 2's str
 type
 to bytes [1]_ [2]_.


 Overriding Principles
 =

 In order to avoid the problems of auto-conversion and Unicode exceptions
 that
 could plague Py2 code, all object checking will be done by duck-typing,
 not by


Don't abbreviate; spell out Python 2.


 values contained in a Unicode representation [3]_.


 Proposed semantics for bytes formatting
 ===

 %-interpolation
 ---

 All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
 will be supported, and will work as they do for str, including the
 padding, justification and other related modifiers.

 Example::

 b'%4x' % 10
b'   a'

 '%#4x' % 10
' 0xa'

 '%04X' % 10
'000A'

 %c will insert a single byte, either from an int in range(256), or from
 a bytes argument of length 1, not from a str.

 Example:

  b'%c' % 48
 b'0'

  b'%c' % b'a'
 b'a'

 %s is restricted in what it will accept::

   - input type supports Py_buffer?
 use it to collect the necessary bytes

   - input type is something else?
 use its __bytes__ method; if there isn't one, raise a TypeError

 Examples:

  b'%s' % b'abc'
 b'abc'

  b'%s' % 3.14
 Traceback (most recent call last):
 ...
 TypeError: 3.14 has no __bytes__ method

  b'%s' % 'hello world!'
 Traceback (most recent call last):
 ...
 TypeError: 'hello world' has no __bytes__ method, perhaps you need to
 encode it?

 .. note::

Because the str type does not have a __bytes__ method, attempts to
directly use 'a string' as a bytes interpolation value will raise an
exception.  To use 'string' values, they must be encoded or otherwise
transformed into a bytes sequence::

   'a string'.encode('latin-1')


 Numeric Format Codes
 

 To properly handle int and float subclasses, int(), index(), and float()
 will be called on the objects intended for (d, i, u), (b, o, x, X), and
 (e, E, f, F, g, G).


 Unsupported codes
 -

 %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not
 supported.


 Proposed variations
 ===

 It was suggested to let %s accept numbers, but since numbers have their own
 format codes this idea was discarded.

 It has been suggested to use %b for bytes instead of %s.

   - Rejected as %b does not exist in Python 2.x %-interpolation, which is
 why we are using %s.

 It has been proposed to automatically use .encode('ascii','strict') for str
 arguments to %s.

   - Rejected as this would lead to intermittent failures.  Better to have
 the
 operation always fail so the trouble-spot can be correctly fixed.

 It has been proposed to have %s return the ascii-encoded repr when the
 value
 is a str  (b'%s' % 'abc'  -- b'abc').

   - Rejected as this would lead to hard to debug failures far from the
 problem
 site.  Better to have the operation always fail so the trouble-spot
 can be
 easily fixed.

 Originally this PEP also proposed adding format style formatting, but it
 was


format-style


 decided that format and its related machinery were all strictly text (aka
 str)
 based, and it was dropped.


that the method and



 Various new special methods were proposed, such as __ascii__,
 __format_bytes___,
 etc.; such methods are not needed at this time, but can be visited again
 later
 if real-world use shows deficiencies with this solution.


 Footnotes
 =

 .. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting
 .. [2] neither string.Template, format, nor str.format are under
 consideration.
 .. [3] %c is not an exception as neither of its possible arguments are
 unicode.


+1 from me
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Neil Schemenauer
Ethan Furman et...@stoneleaf.us wrote:
 Overriding Principles
=

 In order to avoid the problems of auto-conversion and Unicode exceptions that
 could plague Py2 code, all object checking will be done by duck-typing, not by
 values contained in a Unicode representation [3]_.

I think a longer Rational section is justified given the amount of
discussion this feature generated.  Here is a revised version of
what I already suggested:

Rational


A distruptive but useful change introduced in Python 3.0 was the
clean separation of byte strings (i.e. the bytes object) from
character strings (i.e. the str object).  The benefit is that
character encodings must be explicitly specified and the risk of
corrupting character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs
that deal with network protocols often manipulate ASCII encoded
strings or assemble byte strings from fragments.  Since the
bytes type does not support string formatting, extra encoding
and decoding between the str type is often required.

For simplicity and convenience it is desireable to introduce
formatting methods to bytes that allow formatting of
ASCII-encoded character data.  This change would blur the clean
separation of byte strings and character strings.  However, it
is felt that the practical benefits outweigh the purity costs.
The implicit assumption of ASCII-encoding would be limited to
formatting methods.

One source of many problems with the Python 2 Unicode
implementation is the implicit coercion of Unicode character
strings into byte strings using the ascii codec.  If the
character strings contain only ASCII characters, all was well.
However, if the string contains a non-ASCII character then
coercion causes an exception.

The combination of implicit coercion and value dependent
failures has proven to be a recipe for hard to debug errors.  A
program may seem to work correctly when tested (e.g. string
input that happened to be ASCII only) but later would fail,
often with a traceback far from the source of the real error.
The formatting methods for bytes() should avoid this problem by
not implicitly encoding data that might fail based on the
content of the data.

I think we can back off on the duck-typing idea.  It's a good Python
principle but I now realize existing %-interpolation doesn't do it.
The numeric format codes coerce to long or float.  


 Unsupported codes
 -

 %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not
 supported.

I think %a should be supported.  I imagine it would be quite useful
when dumping debugging output to a bytes stream.  It's easy to
implement and I think the danger for abuse or surprises is small.
It would also help when translating Python 2 code, change %r to %a.

 Proposed variations
===

 It was suggested to let %s accept numbers, but since numbers have their own
 format codes this idea was discarded.

 It has been suggested to use %b for bytes instead of %s.

- Rejected as %b does not exist in Python 2.x %-interpolation, which is
  why we are using %s.

I think we should use %b instead of %s.  In that case, I'm fine with
%b not accepting numbers.  Using %b clearly indicates we are
inserting arbitrary bytes.  It also proves a useful code review step
when translating from Python 2.x.

To ease porting from Python 2.x code, I propose adding a
command-line option that enables %s and %r format codes for bytes
%-interpolation.  I'm going to write a draft PEP (it would depend on
PEP 461 being implemented).

 Originally this PEP also proposed adding format style formatting, but it was
 decided that format and its related machinery were all strictly text (aka str)
 based, and it was dropped.

I would also argue that we should limit the scope of this PEP.  It
has already generated a massive amount of discussion.  Nothing
precludes us from adding support for format() to bytes in the
future, if we decide we want it and how it should work.

 Various new special methods were proposed, such as __ascii__,
 __format_bytes___, etc.; such methods are not needed at this time,
 but can be visited again later if real-world use shows
 deficiencies with this solution.

I agree, new special methods are not needed at this time since
numeric codes do use duck-typing and __bytes__ already exists.

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Mark Lawrence

On 17/01/2014 17:46, Neil Schemenauer wrote:


I think we should use %b instead of %s.  In that case, I'm fine with
%b not accepting numbers.  Using %b clearly indicates we are
inserting arbitrary bytes.  It also proves a useful code review step
when translating from Python 2.x.



Using %b could cause problems in the future as b is used in new style 
formatting to mean output numbers in binary, so %B seems to me the 
obvious choice as it's also unused.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Larry Hastings


On 01/17/2014 09:46 AM, Neil Schemenauer wrote:

 Rational
 


Rationale.  Rational is an adjective, Rationale is a noun.

Pedantically yours,


//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Glenn Linderman

On 1/17/2014 8:49 AM, Ethan Furman wrote:

%s is restricted in what it will accept::

  - input type supports Py_buffer?
use it to collect the necessary bytes

  - input type is something else?
use its __bytes__ method; if there isn't one, raise a TypeError

Examples:

 b'%s' % b'abc'
b'abc'

 b'%s' % 3.14
Traceback (most recent call last):
...
TypeError: 3.14 has no __bytes__ method

 b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: 'hello world' has no __bytes__ method, perhaps you need 
to encode it?


If you produce a helpful error message for str (re: encoding), might it 
not be appropriate to produce a helpful error message for builtin number 
types (, perhaps you need a numeric format code?)?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Ethan Furman

On 01/17/2014 11:40 AM, Glenn Linderman wrote:

On 1/17/2014 8:49 AM, Ethan Furman wrote:


 b'%s' % 3.14
Traceback (most recent call last):
...
TypeError: 3.14 has no __bytes__ method


If you produce a helpful error message for str (re: encoding), might it not be 
appropriate to produce a helpful error
message for builtin number types (, perhaps you need a numeric format code?)?


Good point!  Done.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Neil Schemenauer
Mark Lawrence breamore...@yahoo.co.uk wrote:
 Using %b could cause problems in the future as b is used in new style 
 formatting to mean output numbers in binary, so %B seems to me the 
 obvious choice as it's also unused.

After updating my patch, I've decided that %s works better.  My
patch implements PEP 461 as proposed with the following additional
features:

- add %a format code, calls PyObject_ASCII on the argument.  I
  see no reason not too add it as a useful debugging feature.

- add -2 command-line option.  When enabled: %s will fallback
  to calling PyObject_Str() after first trying the buffer API
  and __bytes__.  The value will be encoded using strict ASCII
  encoding.  Also, %r is enabled as an alias for %a.

The patch is v4, bugs.python.org/issue20284, still needs more review
and testing.

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Ethan Furman

On 01/17/2014 08:53 AM, Brett Cannon wrote:


Don't abbreviate; spell out Python 2.


Fixed.



Originally this PEP also proposed adding format style formatting, but it was


format-style


Fixed.



decided that format and its related machinery were all strictly text (aka 
str)
based, and it was dropped.

that the method and


Fixed.

Thanks.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Steven D'Aprano
On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:

 Overriding Principles
 =
 
 In order to avoid the problems of auto-conversion and Unicode exceptions 
 that
 could plague Py2 code, all object checking will be done by duck-typing, not 
 by
 values contained in a Unicode representation [3]_.

I don't understand this paragraph. What does values contained in a 
Unicode representation mean?


[...]
 %s is restricted in what it will accept::
 
   - input type supports Py_buffer?
 use it to collect the necessary bytes

Can you give some examples of what types support Py_buffer? Presumably 
bytes. Anything else?



   - input type is something else?
 use its __bytes__ method; if there isn't one, raise a TypeError

I think you should explicitly state that this is a new special method, 
and state which built-in types will grow a __bytes__ method (if any).


 Numeric Format Codes
 
 
 To properly handle int and float subclasses, int(), index(), and float()
 will be called on the objects intended for (d, i, u), (b, o, x, X), and
 (e, E, f, F, g, G).


-1 on this idea.

This is a rather large violation of the principle of least surprise, and 
radically different from the behaviour of Python 3 str. In Python 3, 
'%d' interpolation calls the __str__ method, so if you subclass, you can 
get the behaviour you want:

py class HexInt(int):
... def __str__(self):
... return hex(self)
...
py %d % HexInt(23)
'0x17'


which is exactly what we should expect from a subclass.

You're suggesting that bytes should ignore any custom display 
implemented by subclasses, and implicitly coerce them to the superclass 
int. What is the justification for this? You don't define or even 
describe what you consider properly handle.



 Unsupported codes
 -
 
 %r (which calls __repr__), and %a (which calls ascii() on __repr__) are not
 supported.

+1 on not supporting b'%r' (i.e. I agree with the PEP).

Why not support b'%a'? That seems to be a strange thing to prohibit.



Everythng else, well done and thank you.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Ethan Furman

On 01/17/2014 05:27 PM, Steven D'Aprano wrote:

On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:


Overriding Principles
=

In order to avoid the problems of auto-conversion and Unicode
exceptions that could plague Py2 code, all object checking will
be done by duck-typing, not by values contained in a Unicode
 representation [3]_.


I don't understand this paragraph. What does values contained in a
Unicode representation mean?


Yeah, that is clunky.  I'm trying to convey the idea that we don't want errors based on content, i.e. which characters 
happens to be in a str.




[...]

%s is restricted in what it will accept::

   - input type supports Py_buffer?
 use it to collect the necessary bytes


Can you give some examples of what types support Py_buffer? Presumably
bytes. Anything else?


Anybody?  Otherwise I'll go spelunking in the code.



   - input type is something else?
 use its __bytes__ method; if there isn't one, raise a TypeError


I think you should explicitly state that this is a new special method,
and state which built-in types will grow a __bytes__ method (if any).


It's not new.  I know bytes, str, and numbers /do not/ have __bytes__.



Numeric Format Codes


To properly handle int and float subclasses, int(), index(), and float()
will be called on the objects intended for (d, i, u), (b, o, x, X), and
(e, E, f, F, g, G).



-1 on this idea.

This is a rather large violation of the principle of least surprise, and
radically different from the behaviour of Python 3 str. In Python 3,
'%d' interpolation calls the __str__ method, so if you subclass, you can
get the behaviour you want:


Did you read the bug reports I linked to?  This behavior (which is a bug) has 
already been fixed for Python3.4.

As a quick thought experiment, why does %d % True return 1?



Unsupported codes
-

%r (which calls __repr__), and %a (which calls ascii() on __repr__) are not
supported.


+1 on not supporting b'%r' (i.e. I agree with the PEP).

Why not support b'%a'? That seems to be a strange thing to prohibit.


I'll admit to being somewhat on the fence about %a.

It seems there are two possibilities with %a:

  1) have it be ascii(repr(obj))

  2) have it be str(obj).encode('ascii', 'strict')

(1) seems only useful for debugging, but even then not very -- if you switch from %s to %a you'll no longer see the 
bytes output (although you would get the name of the object, which could be handy);


(2) is (slightly) blurring the lines between text and encoded-ascii;  I would rather see %s % text.encode('ascii', 
'strict')


So we have two possibilities, both can be useful, I don't know which is most 
useful or even most logical.

So I guess I'm still open to arguments.  :)



Everythng else, well done and thank you.


You're welcome!  Thank you to everyone who participated.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Chris Angelico
On Sat, Jan 18, 2014 at 12:51 PM, Ethan Furman et...@stoneleaf.us wrote:
 It seems there are two possibilities with %a:

   1) have it be ascii(repr(obj))

Wouldn't that be redundant? ascii() is already repr()-like.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Ethan Furman

On 01/17/2014 06:03 PM, Chris Angelico wrote:

On Sat, Jan 18, 2014 at 12:51 PM, Ethan Furman et...@stoneleaf.us wrote:

It seems there are two possibilities with %a:

   1) have it be ascii(repr(obj))


Wouldn't that be redundant? ascii() is already repr()-like.


Good point.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Nick Coghlan
+1 on the technical spec from me. The rationale needs work, but you already
know that :)

For API consistency, I suggest explicitly noting that bytearray will also
support the operation, generating a bytearray result.

I also suggest introducing the phrase ASCII compatible segments in binary
formats somewhere, as the intended use case for *all* the ASCII assuming
methods on the bytes and bytearray types, including this new one.

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Stephen J. Turnbull
Nick Coghlan writes:

  I also suggest introducing the phrase ASCII compatible
  segments in binary formats somewhere,

What is the use case for ASCII *compatible* segments?  Can't you
just say ASCII segments?

I'm not sure exactly what PEP 461 says at this point, but most of the
discussion prescribes .encode('ascii', errors='strict') for implicit
interpolation of str.  ASCII compatible is a term that people
consistently to interpret to include the bytes representation of their
data.  Although the actual rule isn't terribly complex (bytes 0-127
must always have ASCII coded character semantics[1]), AFAIK there are
no use cases for that other than encoded text, ie, interpolating str,
and nobody wants that done leniently in Python 3.

Footnotes: 
[1]  Otherwise you need to analyze the content of data to determine
whether ASCII-compatible operations are safe to perform.  Of course
that's possible but it was repeatedly rejected in favor of duck-typing.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 461 Final?

2014-01-17 Thread Stefan Behnel
Steven D'Aprano, 18.01.2014 02:27:
 On Fri, Jan 17, 2014 at 08:49:21AM -0800, Ethan Furman wrote:
 %s is restricted in what it will accept::

   - input type supports Py_buffer?
 use it to collect the necessary bytes
 
 Can you give some examples of what types support Py_buffer? Presumably 
 bytes. Anything else?

Lots of things: bytes, bytearray, memoryview, array.array, NumPy arrays,
just to name a few.

Basically anything that wants itself to be representable as a chunk of
memory with metadata. It's a very common thing in the Big Data department
(although many people wouldn't know that they're actually heavy users of
this protocol because they just use NumPy and/or Cython and don't look
under the hood).

Stefan


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com