[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-11 Thread Bengt Richter
On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote:

>> On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL 
>> PROTECTED]> > >The backwards compatibility problems *seem* to be relatively 
>> minor.
>> >I only found one instance of breakage in the standard library.  Note
>> >that my patch does not change PyObject_Str(); that would break
>> >massive amounts of code.  Instead, I introduce a new function:
>> >PyString_New().  I'm not crazy about the name but I couldn't think
>> >of anything better.
>
>On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote:
>> Should this not be coordinated with PEP 332?
>
>Probably.. But that PEP is rather incomplete. Wanna work on fixing that?
>
I'd be glad to add my thoughts, but first of course it's Skip's PEP,
and Martin casts a long shadow when it comes to character coding issues
that I suspect will have to be considered.

(E.g., if there is a b'...' literal for bytes, the actual characters of
the source code itself that the literal is being expressed in could be ascii
or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the source
is at least temporarily normalized to Unicode, and then re-encoded (except now
for string literals?) per coding cookie or other encoding inference. (I may be
out of date, gotta catch up).

If one way or the other a string literal is in Unicode, then presumably so is
a byte string b'...' literal -- i.e. internally u"b'...'" just before
being turned into bytes.

Should that then be an internal straight u"b'...'".encode('byte') with default 
ascii + escapes
for non-ascii and non-printables, to define the full 8 bits without encoding 
error?
Should unicode be encodable into byte via a specific encoding? E.g., 
u'abc'.encode('byte','latin1'),
to distinguish producing a mutable byte string vs an immutable str type as with 
u'abc'.encode('latin1').
(but how does this play with str being able to produce unicode? And when do 
these changes happen?)
I guess I'm getting ahead of myself ;-)

So I would first ask Skip what he'd like to do, and Martin for some hints on 
reading, to avoid
going down paths he already knows lead to brick walls ;-) And I need to think 
more about PEP 349.

I would propose to do the reading they suggest, and edit up a new version of 
pep-0332.txt
that anyone could then improve further. I don't know about an early deadline. I 
don't want
to over-commit, as time and energies vary. OTOH, as you've noticed, I could be 
spending my
time more effectively ;-)

I changed the thread title, and will wait for some signs from you, Skip, 
Martin, Neil, and I don't
know who else might be interested...

Regards,
Bengt Richter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Jim Jewett
On 2/14/06, Neil Schemenauer  wrote:
> People could spell it bytes(s.encode('latin-1')) in order to make it
> work in 2.X.

Guido wrote:
> At the cost of an extra copying step.

That sounds like an implementation issue.  If it is important
enough to matter, then why not just add some smarts to the
bytes constructor?

If the argument is a str, and the constructor owns the only
reference, then go ahead and use the argument's own
underlying array; the string itself will be deallocated when
(or before) the constructor returns, so no one else can use
it expecting an immutable.

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
One recommendation: for starters, I'd much rather see the bytes type
standardized without a literal notation. There should be are lots of
ways to create bytes objects from string objects, with specific
explicit encodings, and those should suffice, at least initially.

I also wonder if having a b"..." literal would just add more confusion
-- bytes are not characters, but b"..." makes it appear as if they
are.

--Guido

On 2/11/06, Bengt Richter <[EMAIL PROTECTED]> wrote:
> On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> 
> wrote:
>
> >> On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL 
> >> PROTECTED]> > >The backwards compatibility problems *seem* to be 
> >> relatively minor.
> >> >I only found one instance of breakage in the standard library.  Note
> >> >that my patch does not change PyObject_Str(); that would break
> >> >massive amounts of code.  Instead, I introduce a new function:
> >> >PyString_New().  I'm not crazy about the name but I couldn't think
> >> >of anything better.
> >
> >On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote:
> >> Should this not be coordinated with PEP 332?
> >
> >Probably.. But that PEP is rather incomplete. Wanna work on fixing that?
> >
> I'd be glad to add my thoughts, but first of course it's Skip's PEP,
> and Martin casts a long shadow when it comes to character coding issues
> that I suspect will have to be considered.
>
> (E.g., if there is a b'...' literal for bytes, the actual characters of
> the source code itself that the literal is being expressed in could be ascii
> or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the 
> source
> is at least temporarily normalized to Unicode, and then re-encoded (except now
> for string literals?) per coding cookie or other encoding inference. (I may be
> out of date, gotta catch up).
>
> If one way or the other a string literal is in Unicode, then presumably so is
> a byte string b'...' literal -- i.e. internally u"b'...'" just before
> being turned into bytes.
>
> Should that then be an internal straight u"b'...'".encode('byte') with 
> default ascii + escapes
> for non-ascii and non-printables, to define the full 8 bits without encoding 
> error?
> Should unicode be encodable into byte via a specific encoding? E.g., 
> u'abc'.encode('byte','latin1'),
> to distinguish producing a mutable byte string vs an immutable str type as 
> with u'abc'.encode('latin1').
> (but how does this play with str being able to produce unicode? And when do 
> these changes happen?)
> I guess I'm getting ahead of myself ;-)
>
> So I would first ask Skip what he'd like to do, and Martin for some hints on 
> reading, to avoid
> going down paths he already knows lead to brick walls ;-) And I need to think 
> more about PEP 349.
>
> I would propose to do the reading they suggest, and edit up a new version of 
> pep-0332.txt
> that anyone could then improve further. I don't know about an early deadline. 
> I don't want
> to over-commit, as time and energies vary. OTOH, as you've noticed, I could 
> be spending my
> time more effectively ;-)
>
> I changed the thread title, and will wait for some signs from you, Skip, 
> Martin, Neil, and I don't
> know who else might be interested...
>
> Regards,
> Bengt Richter
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/guido%40python.org
>


--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread M.-A. Lemburg
Guido van Rossum wrote:
> One recommendation: for starters, I'd much rather see the bytes type
> standardized without a literal notation. There should be are lots of
> ways to create bytes objects from string objects, with specific
> explicit encodings, and those should suffice, at least initially.
> 
> I also wonder if having a b"..." literal would just add more confusion
> -- bytes are not characters, but b"..." makes it appear as if they
> are.

Agreed.

Given that we have a source code encoding which would need
to be honored, b"..." doesn't really make all that much sense
(unless you always use hex escapes).

Note that if we drop the string type, all codecs which currently
return strings will have to return bytes. This gives you a pretty
exhaustive way of defining your binary literals in Python :-)

Here's one:

data = "abc".encode("latin-1")

To simplify things we might want to have

bytes("abc")

do the above encoding per default.

> --Guido
> 
> On 2/11/06, Bengt Richter <[EMAIL PROTECTED]> wrote:
>> On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> 
>> wrote:
>>
 On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL 
 PROTECTED]> > >The backwards compatibility problems *seem* to be 
 relatively minor.
> I only found one instance of breakage in the standard library.  Note
> that my patch does not change PyObject_Str(); that would break
> massive amounts of code.  Instead, I introduce a new function:
> PyString_New().  I'm not crazy about the name but I couldn't think
> of anything better.
>>> On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote:
 Should this not be coordinated with PEP 332?
>>> Probably.. But that PEP is rather incomplete. Wanna work on fixing that?
>>>
>> I'd be glad to add my thoughts, but first of course it's Skip's PEP,
>> and Martin casts a long shadow when it comes to character coding issues
>> that I suspect will have to be considered.
>>
>> (E.g., if there is a b'...' literal for bytes, the actual characters of
>> the source code itself that the literal is being expressed in could be ascii
>> or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the 
>> source
>> is at least temporarily normalized to Unicode, and then re-encoded (except 
>> now
>> for string literals?) per coding cookie or other encoding inference. (I may 
>> be
>> out of date, gotta catch up).
>>
>> If one way or the other a string literal is in Unicode, then presumably so is
>> a byte string b'...' literal -- i.e. internally u"b'...'" just before
>> being turned into bytes.
>>
>> Should that then be an internal straight u"b'...'".encode('byte') with 
>> default ascii + escapes
>> for non-ascii and non-printables, to define the full 8 bits without encoding 
>> error?
>> Should unicode be encodable into byte via a specific encoding? E.g., 
>> u'abc'.encode('byte','latin1'),
>> to distinguish producing a mutable byte string vs an immutable str type as 
>> with u'abc'.encode('latin1').
>> (but how does this play with str being able to produce unicode? And when do 
>> these changes happen?)
>> I guess I'm getting ahead of myself ;-)
>>
>> So I would first ask Skip what he'd like to do, and Martin for some hints on 
>> reading, to avoid
>> going down paths he already knows lead to brick walls ;-) And I need to 
>> think more about PEP 349.
>>
>> I would propose to do the reading they suggest, and edit up a new version of 
>> pep-0332.txt
>> that anyone could then improve further. I don't know about an early 
>> deadline. I don't want
>> to over-commit, as time and energies vary. OTOH, as you've noticed, I could 
>> be spending my
>> time more effectively ;-)
>>
>> I changed the thread title, and will wait for some signs from you, Skip, 
>> Martin, Neil, and I don't
>> know who else might be interested...
>>
>> Regards,
>> Bengt Richter
>>
>> ___
>> Python-Dev mailing list
>> Python-Dev@python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: 
>> http://mail.python.org/mailman/options/python-dev/guido%40python.org
>>
> 
> 
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 13 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-D

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Phillip J. Eby
At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote:
>One recommendation: for starters, I'd much rather see the bytes type
>standardized without a literal notation. There should be are lots of
>ways to create bytes objects from string objects, with specific
>explicit encodings, and those should suffice, at least initially.
>
>I also wonder if having a b"..." literal would just add more confusion
>-- bytes are not characters, but b"..." makes it appear as if they
>are.

Why not just have the constructor be:

 bytes(initializer [,encoding])

Where initializer must be either an iterable of suitable integers, or a 
unicode/string object.  If the latter (i.e., it's a basestring), the 
encoding argument would then be required.  Then, there's no need for 
special codec support for the bytes type, since you call bytes on the thing 
to be encoded.  And of course, no need for a 'b' literal.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote:
> >One recommendation: for starters, I'd much rather see the bytes type
> >standardized without a literal notation. There should be are lots of
> >ways to create bytes objects from string objects, with specific
> >explicit encodings, and those should suffice, at least initially.
> >
> >I also wonder if having a b"..." literal would just add more confusion
> >-- bytes are not characters, but b"..." makes it appear as if they
> >are.
>
> Why not just have the constructor be:
>
>  bytes(initializer [,encoding])
>
> Where initializer must be either an iterable of suitable integers, or a
> unicode/string object.  If the latter (i.e., it's a basestring), the
> encoding argument would then be required.  Then, there's no need for
> special codec support for the bytes type, since you call bytes on the thing
> to be encoded.  And of course, no need for a 'b' literal.

It'd be cruel and unusual punishment though to have to write

  bytes("abc", "Latin-1")

I propose that the default encoding (for basestring instances) ought
to be "ascii" just like everywhere else. (Meaning, it should really be
the system default encoding, which defaults to "ascii" and is
intentionally hard to change.)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread M.-A. Lemburg
Guido van Rossum wrote:
> On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
>> At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote:
>>> One recommendation: for starters, I'd much rather see the bytes type
>>> standardized without a literal notation. There should be are lots of
>>> ways to create bytes objects from string objects, with specific
>>> explicit encodings, and those should suffice, at least initially.
>>>
>>> I also wonder if having a b"..." literal would just add more confusion
>>> -- bytes are not characters, but b"..." makes it appear as if they
>>> are.
>> Why not just have the constructor be:
>>
>>  bytes(initializer [,encoding])
>>
>> Where initializer must be either an iterable of suitable integers, or a
>> unicode/string object.  If the latter (i.e., it's a basestring), the
>> encoding argument would then be required.  Then, there's no need for
>> special codec support for the bytes type, since you call bytes on the thing
>> to be encoded.  And of course, no need for a 'b' literal.
> 
> It'd be cruel and unusual punishment though to have to write
> 
>   bytes("abc", "Latin-1")
> 
> I propose that the default encoding (for basestring instances) ought
> to be "ascii" just like everywhere else. (Meaning, it should really be
> the system default encoding, which defaults to "ascii" and is
> intentionally hard to change.)

We're talking about Py3k here: "abc" will be a Unicode string,
so why restrict the conversion to 7 bits when you can have 8 bits
without any conversion problems ?

While we're at it: I'd suggest that we remove the auto-conversion
from bytes to Unicode in Py3k and the default encoding along with
it. In Py3k the standard lib will have to be Unicode compatible
anyway and string parser markers like "s#" will have to go away
as well, so there's not much need for this anymore.

(Maybe a bit radical, but I guess that's what Py3k is meant for.)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 13 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Phillip J. Eby
At 10:55 PM 2/13/2006 +0100, M.-A. Lemburg wrote:
>Guido van Rossum wrote:
> > On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> >> At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote:
> >>> One recommendation: for starters, I'd much rather see the bytes type
> >>> standardized without a literal notation. There should be are lots of
> >>> ways to create bytes objects from string objects, with specific
> >>> explicit encodings, and those should suffice, at least initially.
> >>>
> >>> I also wonder if having a b"..." literal would just add more confusion
> >>> -- bytes are not characters, but b"..." makes it appear as if they
> >>> are.
> >> Why not just have the constructor be:
> >>
> >>  bytes(initializer [,encoding])
> >>
> >> Where initializer must be either an iterable of suitable integers, or a
> >> unicode/string object.  If the latter (i.e., it's a basestring), the
> >> encoding argument would then be required.  Then, there's no need for
> >> special codec support for the bytes type, since you call bytes on the 
> thing
> >> to be encoded.  And of course, no need for a 'b' literal.
> >
> > It'd be cruel and unusual punishment though to have to write
> >
> >   bytes("abc", "Latin-1")
> >
> > I propose that the default encoding (for basestring instances) ought
> > to be "ascii" just like everywhere else. (Meaning, it should really be
> > the system default encoding, which defaults to "ascii" and is
> > intentionally hard to change.)
>
>We're talking about Py3k here: "abc" will be a Unicode string,
>so why restrict the conversion to 7 bits when you can have 8 bits
>without any conversion problems ?

Actually, I thought we were talking about adding bytes() in 2.5.

However, now that you've brought this up, it actually makes perfect sense 
to just use latin-1 as the effective encoding for both strings and 
unicode.  In Python 2.x, strings are byte strings by definition, so it's 
only in 3.0 that an encoding would be required.  And again, latin1 is a 
reasonable, roundtrippable default encoding.

So, it sounds like making the encoding default to latin-1 would be a 
reasonably safe approach in both 2.x and 3.x.


>While we're at it: I'd suggest that we remove the auto-conversion
>from bytes to Unicode in Py3k and the default encoding along with
>it. In Py3k the standard lib will have to be Unicode compatible
>anyway and string parser markers like "s#" will have to go away
>as well, so there's not much need for this anymore.

I thought all this was already in the plan for 3.0, but maybe I assume too 
much.  :)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread M.-A. Lemburg
Phillip J. Eby wrote:
 Why not just have the constructor be:

  bytes(initializer [,encoding])

 Where initializer must be either an iterable of suitable integers, or a
 unicode/string object.  If the latter (i.e., it's a basestring), the
 encoding argument would then be required.  Then, there's no need for
 special codec support for the bytes type, since you call bytes on the 
>> thing
 to be encoded.  And of course, no need for a 'b' literal.
>>> It'd be cruel and unusual punishment though to have to write
>>>
>>>   bytes("abc", "Latin-1")
>>>
>>> I propose that the default encoding (for basestring instances) ought
>>> to be "ascii" just like everywhere else. (Meaning, it should really be
>>> the system default encoding, which defaults to "ascii" and is
>>> intentionally hard to change.)
>> We're talking about Py3k here: "abc" will be a Unicode string,
>> so why restrict the conversion to 7 bits when you can have 8 bits
>> without any conversion problems ?
> 
> Actually, I thought we were talking about adding bytes() in 2.5.

Then we'd need to make the "ascii" encoding assumption
again, just like Guido proposed.

> However, now that you've brought this up, it actually makes perfect sense 
> to just use latin-1 as the effective encoding for both strings and 
> unicode.  In Python 2.x, strings are byte strings by definition, so it's 
> only in 3.0 that an encoding would be required.  And again, latin1 is a 
> reasonable, roundtrippable default encoding.

It is. However, it's not a reasonable assumption of the
default encoding since there are many encodings out there
that special case the characters 0x80-0xFF, hence the choice
of using ASCII as default encoding in Python.

The conversion from Unicode to bytes is different in this
respect, since you are converting from a "bigger" type to
a "smaller" one. Choosing latin-1 as default for this
conversion would give you all 8 bits, instead of just 7
bits that ASCII provides.

> So, it sounds like making the encoding default to latin-1 would be a 
> reasonably safe approach in both 2.x and 3.x.

Reasonable for bytes(): yes. In general: no.

>> While we're at it: I'd suggest that we remove the auto-conversion
>>from bytes to Unicode in Py3k and the default encoding along with
>> it. In Py3k the standard lib will have to be Unicode compatible
>> anyway and string parser markers like "s#" will have to go away
>> as well, so there's not much need for this anymore.
> 
> I thought all this was already in the plan for 3.0, but maybe I assume too 
> much.  :)

Wouldn't want to wait for Py4D :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 13 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
> > It'd be cruel and unusual punishment though to have to write
> >
> >   bytes("abc", "Latin-1")
> >
> > I propose that the default encoding (for basestring instances) ought
> > to be "ascii" just like everywhere else. (Meaning, it should really be
> > the system default encoding, which defaults to "ascii" and is
> > intentionally hard to change.)
>
> We're talking about Py3k here: "abc" will be a Unicode string,
> so why restrict the conversion to 7 bits when you can have 8 bits
> without any conversion problems ?

As Phillip guessed, I was indeed thinking about introducing bytes()
sooner than that, perhaps even in 2.5 (though I don't want anything
rushed).

Even in Py3k though, the encoding issue stands -- what if the file
encoding is Unicode? Then using Latin-1 to encode bytes by default
might not by what the user expected. Or what if the file encoding is
something totally different? (Cyrillic, Greek, Japanese, Klingon.)
Anything default but ASCII isn't going to work as expected. ASCII
isn't going to work as expected either, but it will complain loudly
(by throwing a UnicodeError) whenever you try it, rather than causing
subtle bugs later.

> While we're at it: I'd suggest that we remove the auto-conversion
> from bytes to Unicode in Py3k and the default encoding along with
> it.

I'm not sure which auto-conversion you're talking about, since there
is no bytes type yet. If you're talking about the auto-conversion from
str to unicode: the bytes type should not be assumed to have *any*
properties that the current str type has, and that includes
auto-conversion.

> In Py3k the standard lib will have to be Unicode compatible
> anyway and string parser markers like "s#" will have to go away
> as well, so there's not much need for this anymore.
>
> (Maybe a bit radical, but I guess that's what Py3k is meant for.)

Right.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> Actually, I thought we were talking about adding bytes() in 2.5.

I was.

> However, now that you've brought this up, it actually makes perfect sense
> to just use latin-1 as the effective encoding for both strings and
> unicode.  In Python 2.x, strings are byte strings by definition, so it's
> only in 3.0 that an encoding would be required.  And again, latin1 is a
> reasonable, roundtrippable default encoding.
>
> So, it sounds like making the encoding default to latin-1 would be a
> reasonably safe approach in both 2.x and 3.x.

I disagree. IMO the same reasons why we don't do this now for the
conversion between str and unicode stands for bytes.

> >While we're at it: I'd suggest that we remove the auto-conversion
> >from bytes to Unicode in Py3k and the default encoding along with
> >it. In Py3k the standard lib will have to be Unicode compatible
> >anyway and string parser markers like "s#" will have to go away
> >as well, so there's not much need for this anymore.

I don't know yet what the C API will look like in 3.0. But it may well
have to support auto-conversion from Unicode to char* using some
system default encoding (e.g. the Windows default code page?) in order
to be able to conveniently wrap OS APIs that use char* instead of some
sort of Unicode (and each OS has its own way of interpreting char* as
Unicode -- I believe Apple uses UTF-8?).

> I thought all this was already in the plan for 3.0, but maybe I assume too
> much.  :)

In Py3k, I can see two reasonable approaches to conversion between
strings (Unicode) and bytes: always require an explicit encoding, or
assume ASCII. Anything else is asking for trouble IMO.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Phillip J. Eby
At 12:03 AM 2/14/2006 +0100, M.-A. Lemburg wrote:
>The conversion from Unicode to bytes is different in this
>respect, since you are converting from a "bigger" type to
>a "smaller" one. Choosing latin-1 as default for this
>conversion would give you all 8 bits, instead of just 7
>bits that ASCII provides.

I was just pointing out that since byte strings are bytes by definition, 
then simply putting those bytes in a bytes() object doesn't alter the 
existing encoding.  So, using latin-1 when converting a string to bytes 
actually seems like the the One Obvious Way to do it.

I'm so accustomed to being wary of encoding issues that the idea doesn't 
*feel* right at first - I keep going, "but you can't know what encoding 
those bytes are".  Then I go, Duh, that's the point.  If you convert 
str->bytes, there's no conversion and no interpretation - neither the str 
nor the bytes object knows its encoding, and that's okay.  So 
str(bytes_object) (in 2.x) should also just turn it back to a normal 
bytestring.

In fact, the 'encoding' argument seems useless in the case of str objects, 
and it seems it should default to latin-1 for unicode objects.  The only 
use I see for having an encoding for a 'str' would be to allow confirming 
that the input string in fact is valid for that encoding.  So, 
"bytes(some_str,'ascii')" would be an assertion that some_str must be valid 
ASCII.


> > So, it sounds like making the encoding default to latin-1 would be a
> > reasonably safe approach in both 2.x and 3.x.
>
>Reasonable for bytes(): yes. In general: no.

Right, I was only talking about bytes().

For 3.0, the type formerly known as "str" won't exist, so only the Unicode 
part will be relevant then.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> At 12:03 AM 2/14/2006 +0100, M.-A. Lemburg wrote:
> >The conversion from Unicode to bytes is different in this
> >respect, since you are converting from a "bigger" type to
> >a "smaller" one. Choosing latin-1 as default for this
> >conversion would give you all 8 bits, instead of just 7
> >bits that ASCII provides.
>
> I was just pointing out that since byte strings are bytes by definition,
> then simply putting those bytes in a bytes() object doesn't alter the
> existing encoding.  So, using latin-1 when converting a string to bytes
> actually seems like the the One Obvious Way to do it.

This actually makes some sense -- bytes(s) where isinstance(s, str)
should just copy the data, since we can't know what encoding the user
believes it is in anyway. (With the exception of string literals,
where it makes sense to assume that the user believes it is in the
same encoding as the source code -- but I believe non-ASCII characters
in string literals are disallowed anyway, or at least known to cause
undefined results in rats.)

> I'm so accustomed to being wary of encoding issues that the idea doesn't
> *feel* right at first - I keep going, "but you can't know what encoding
> those bytes are".  Then I go, Duh, that's the point.  If you convert
> str->bytes, there's no conversion and no interpretation - neither the str
> nor the bytes object knows its encoding, and that's okay.  So
> str(bytes_object) (in 2.x) should also just turn it back to a normal
> bytestring.

You've got me convinced. Scrap my previous responses in this thread.

> In fact, the 'encoding' argument seems useless in the case of str objects,

Right.

> and it seems it should default to latin-1 for unicode objects.

But here I disagree.

> The only
> use I see for having an encoding for a 'str' would be to allow confirming
> that the input string in fact is valid for that encoding.  So,
> "bytes(some_str,'ascii')" would be an assertion that some_str must be valid
> ASCII.

We already have ways to assert that a string is ASCII.

> For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> part will be relevant then.

And I think then the encoding should be required or default to ASCII.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Michael Foord
Phillip J. Eby wrote:
[snip..]
>
> In fact, the 'encoding' argument seems useless in the case of str objects, 
> and it seems it should default to latin-1 for unicode objects.  The only 
>   
-1 for having an implicit encode that behaves differently to other 
implicit encodes/decodes that happen in Python. Life is confusing enough 
already.

Michael Foord

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote:
> Phillip J. Eby wrote:
> [snip..]
> >
> > In fact, the 'encoding' argument seems useless in the case of str objects,
> > and it seems it should default to latin-1 for unicode objects.  The only
> >
> -1 for having an implicit encode that behaves differently to other
> implicit encodes/decodes that happen in Python. Life is confusing enough
> already.

But adding an encoding doesn't help. The str.encode() method always
assumes that the string itself is ASCII-encoded, and that's not good
enough:

>>> "abc".encode("latin-1")
'abc'
>>> "abc".decode("latin-1")
u'abc'
>>> "abc\xf0".decode("latin-1")
u'abc\xf0'
>>> "abc\xf0".encode("latin-1")
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
3: ordinal not in range(128)
>>>

The right way to look at this is, as Phillip says, to consider
conversion between str and bytes as not an encoding but a data type
change *only*.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Barry Warsaw
On Mon, 2006-02-13 at 15:44 -0800, Guido van Rossum wrote:

> The right way to look at this is, as Phillip says, to consider
> conversion between str and bytes as not an encoding but a data type
> change *only*.

That sounds right to me too.
-Barry



signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Michael Foord
Guido van Rossum wrote:
> On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote:
>   
>> Phillip J. Eby wrote:
>> [snip..]
>> 
>>> In fact, the 'encoding' argument seems useless in the case of str objects,
>>> and it seems it should default to latin-1 for unicode objects.  The only
>>>
>>>   
>> -1 for having an implicit encode that behaves differently to other
>> implicit encodes/decodes that happen in Python. Life is confusing enough
>> already.
>> 
>
> But adding an encoding doesn't help. The str.encode() method always
> assumes that the string itself is ASCII-encoded, and that's not good
> enough:
>
>   
Sorry - I meant for the unicode to bytes case. A default encoding that 
behaves differently to the current to implicit encodes/decodes would be 
confusing IMHO.

I agree that string to bytes shouldn't change the value of the bytes. 
The least confusing description of a non-unicode string is 'byte-string'.

Michael Foord
 "abc".encode("latin-1")
 
> 'abc'
>   
 "abc".decode("latin-1")
 
> u'abc'
>   
 "abc\xf0".decode("latin-1")
 
> u'abc\xf0'
>   
 "abc\xf0".encode("latin-1")
 
> Traceback (most recent call last):
>   File "", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
> 3: ordinal not in range(128)
>   
>
> The right way to look at this is, as Phillip says, to consider
> conversion between str and bytes as not an encoding but a data type
> change *only*.
>
> --
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
>   

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Phillip J. Eby
At 03:23 PM 2/13/2006 -0800, Guido van Rossum wrote:
>On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> > The only
> > use I see for having an encoding for a 'str' would be to allow confirming
> > that the input string in fact is valid for that encoding.  So,
> > "bytes(some_str,'ascii')" would be an assertion that some_str must be valid
> > ASCII.
>
>We already have ways to assert that a string is ASCII.

I didn't mean that it was the only purpose.  In Python 2.x, practical code 
has to sometimes deal with "string-like" objects.  That is, code that takes 
either strings or unicode.  If such code calls bytes(), it's going to want 
to include an encoding so that unicode conversions won't fail.  But 
silently ignoring the encoding argument in that case isn't a good idea.

Ergo, I propose to permit the encoding to be specified when passing in a 
(2.x) str object, to allow code that handles both str and unicode to be 
"str-stable" in 2.x.

I'm fine with rejecting an encoding argument if the initializer is not a 
str or unicode; I just don't want the call signature to vary based on a 
runtime distinction between str and unicode.  And, I don't want the 
encoding argument to be silently ignored when you pass in a string.  If I 
assert that I'm encoding ASCII (or utf-8 or whatever), then the string 
should be required to be valid.  If I don't pass in an encoding, then I'm 
good to go.

(This is orthogonal to the issue of what encoding is used as a default for 
conversions from the unicode type, btw.)


> > For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> > part will be relevant then.
>
>And I think then the encoding should be required or default to ASCII.

The reason I'm arguing for latin-1 is symmetry in 2.x versions only.  (In 
3.x, there's no str vs. unicode, and thus nothing to be symmetrical.)  So, 
if you invoke bytes() without an encoding on a 2.x basestring, you should 
get the same result.  Latin-1 produces "the same result" when viewed in 
terms of the resulting byte string.

If we don't go with latin-1, I'd argue for requiring an encoding for 
unicode objects in 2.x, because that seems like the only reasonable way to 
break the symmetry between str and unicode, even though it forces 
"str-stable" code to specify an encoding.  The key is that at least *one* 
of the signatures needs to be stable in meaning across both str and unicode 
in 2.x in order to allow unicode-safe, str-stable code to be written.

(Again, for 3.x, this issue doesn't come into play because there's only one 
string type to worry about; what the default is or whether there's a 
default is therefore entirely up to you.)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote:
> Sorry - I meant for the unicode to bytes case. A default encoding that
> behaves differently to the current to implicit encodes/decodes would be
> confusing IMHO.

And I am in agreement with you there (I think only Phillip argued otherwise).

> I agree that string to bytes shouldn't change the value of the bytes.

It's a deal then.

Can the owner of PEP 332 update the PEP to record these decisions?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> I didn't mean that it was the only purpose.  In Python 2.x, practical code
> has to sometimes deal with "string-like" objects.  That is, code that takes
> either strings or unicode.  If such code calls bytes(), it's going to want
> to include an encoding so that unicode conversions won't fail.

That sounds like a rather hypothetical example. Have you thought it
through? Presumably code that accepts both str and unicode either
doesn't care about encodings, but simply returns objects of the same
type as the arguments -- and then it's unlikely to want to convert the
arguments to bytes; or it *does* care about encodings, and then it
probably already has to special-case str vs. unicode because it has to
control how str objects are interpreted.

> But
> silently ignoring the encoding argument in that case isn't a good idea.
>
> Ergo, I propose to permit the encoding to be specified when passing in a
> (2.x) str object, to allow code that handles both str and unicode to be
> "str-stable" in 2.x.

Again, have you thought this through?

What would bytes("abc\xf0", "latin-1") *mean*? Take the string
"abc\xf0", interpret it as being encoded in XXX, and then encode from
XXX to Latin-1. But what's XXX? As I showed in a previous post,
"abc\xf0".encode("latin-1") *fails* because the source for the
encoding is assumed to be ASCII.

I think we can make this work only when the string in fact only
contains ASCII and the encoding maps ASCII to itself (which most
encodings do -- but e.g. EBCDIC does not). But I'm not sure how useful
that is.

> I'm fine with rejecting an encoding argument if the initializer is not a
> str or unicode; I just don't want the call signature to vary based on a
> runtime distinction between str and unicode.

I'm still not sure that this will actually help anyone.

> And, I don't want the
> encoding argument to be silently ignored when you pass in a string.

Agreed.

> If I
> assert that I'm encoding ASCII (or utf-8 or whatever), then the string
> should be required to be valid.

Defined how? That the string is already in that encoding?

> If I don't pass in an encoding, then I'm
> good to go.
>
> (This is orthogonal to the issue of what encoding is used as a default for
> conversions from the unicode type, btw.)

Right. The issues are completely different!

> > > For 3.0, the type formerly known as "str" won't exist, so only the Unicode
> > > part will be relevant then.
> >
> >And I think then the encoding should be required or default to ASCII.
>
> The reason I'm arguing for latin-1 is symmetry in 2.x versions only.  (In
> 3.x, there's no str vs. unicode, and thus nothing to be symmetrical.)  So,
> if you invoke bytes() without an encoding on a 2.x basestring, you should
> get the same result.  Latin-1 produces "the same result" when viewed in
> terms of the resulting byte string.

Only if you assume the str object is encoded in Latin-1.

Your argument for symmetry would be a lot stronger if we used Latin-1
for the conversion between str and Unicode. But we don't. I like the
other interpretation (which I thought was yours too?) much better: str
<--> bytes conversions don't use encodings by simply change the type
without changing the bytes; conversion between either and unicode
works exactly the same, and requires an encoding unless all the
characters involved are pure ASCII.

> If we don't go with latin-1, I'd argue for requiring an encoding for
> unicode objects in 2.x, because that seems like the only reasonable way to
> break the symmetry between str and unicode, even though it forces
> "str-stable" code to specify an encoding.  The key is that at least *one*
> of the signatures needs to be stable in meaning across both str and unicode
> in 2.x in order to allow unicode-safe, str-stable code to be written.

Using ASCII as the default encoding has the same property -- it can
remain stable across the 2.x / 3.0 boundary.

> (Again, for 3.x, this issue doesn't come into play because there's only one
> string type to worry about; what the default is or whether there's a
> default is therefore entirely up to you.)

A nice-to-have property would be that it might be possible to write
code that today deals with Unicode and str, but in 3.0 will deal with
Unicode and bytes instead. But I'm not sure how likely that is since
bytes objects won't have most methods that str and Unicode objects
have (like lower(), find(), etc.).

There's one property that bytes, str and unicode all share: type(x[0])
== type(x), at least as long as len(x) >= 1. This is perhaps the
ultimate test for string-ness.

Or should b[0] be an int, if b is a bytes object? That would change
things dramatically.

There's also the consideration for APIs that, informally, accept
either a string or a sequence of objects. Many of these exist, and
they are probably all being converted to support unicode as well as
str (if it makes sense at all). Should a bytes object be considered as
a sequen

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread James Y Knight
On Feb 13, 2006, at 7:09 PM, Guido van Rossum wrote:

> On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote:
>> Sorry - I meant for the unicode to bytes case. A default encoding  
>> that
>> behaves differently to the current to implicit encodes/decodes  
>> would be
>> confusing IMHO.
>
> And I am in agreement with you there (I think only Phillip argued  
> otherwise).
>
>> I agree that string to bytes shouldn't change the value of the bytes.
>
> It's a deal then.
>
> Can the owner of PEP 332 update the PEP to record these decisions?

So, in python2.X, you have:
- bytes("\x80"), you get a bytestring with a single byte of value  
0x80 (when no encoding is specified, and the object is a str, it  
doesn't try to encode it at all).
- bytes("\x80", encoding="latin-1"), you get an error, because  
encoding "\x80" into latin-1 implicitly decodes it into a unicode  
object first, via the system-wide default: ascii.
- bytes(u"\x80"), you get an error, because the default encoding for  
a unicode string is ascii.
- bytes(u"\x80", encoding="latin-1"), you get a bytestring with a  
single byte of value 0x80.

In py3k, when the str object is eliminated, then what do you have?  
Perhaps
- bytes("\x80"), you get an error, encoding is required. There is no  
such thing as "default encoding" anymore, as there's no str object.
- bytes("\x80", encoding="latin-1"), you get a bytestring with a  
single byte of value 0x80.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, James Y Knight <[EMAIL PROTECTED]> wrote:
> So, in python2.X, you have:
> - bytes("\x80"), you get a bytestring with a single byte of value
> 0x80 (when no encoding is specified, and the object is a str, it
> doesn't try to encode it at all).
> - bytes("\x80", encoding="latin-1"), you get an error, because
> encoding "\x80" into latin-1 implicitly decodes it into a unicode
> object first, via the system-wide default: ascii.
> - bytes(u"\x80"), you get an error, because the default encoding for
> a unicode string is ascii.
> - bytes(u"\x80", encoding="latin-1"), you get a bytestring with a
> single byte of value 0x80.

Yes to all.

> In py3k, when the str object is eliminated, then what do you have?
> Perhaps
> - bytes("\x80"), you get an error, encoding is required. There is no
> such thing as "default encoding" anymore, as there's no str object.
> - bytes("\x80", encoding="latin-1"), you get a bytestring with a
> single byte of value 0x80.

Yes to both again.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Neil Schemenauer
Guido van Rossum <[EMAIL PROTECTED]> wrote:
>> In py3k, when the str object is eliminated, then what do you have?
>> Perhaps
>> - bytes("\x80"), you get an error, encoding is required. There is no
>> such thing as "default encoding" anymore, as there's no str object.
>> - bytes("\x80", encoding="latin-1"), you get a bytestring with a
>> single byte of value 0x80.
>
> Yes to both again.

I haven't been following this dicussion about bytes() real closely
but I don't think that bytes() should do the encoding.  We already
have a way to spell that:

"\x80".encode('latin-1')

Also, I think it would useful to introduce byte array literals at
the same time as the bytes object.  That would allow people to use
byte arrays without having to get involved with all the silly string
encoding confusion.

  Neil

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Fred L. Drake, Jr.
On Monday 13 February 2006 21:52, Neil Schemenauer wrote:
 > Also, I think it would useful to introduce byte array literals at
 > the same time as the bytes object.  That would allow people to use
 > byte arrays without having to get involved with all the silly string
 > encoding confusion.

bytes([0, 1, 2, 3])


  -Fred

-- 
Fred L. Drake, Jr.   
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Guido van Rossum
On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
> Guido van Rossum <[EMAIL PROTECTED]> wrote:
> >> In py3k, when the str object is eliminated, then what do you have?
> >> Perhaps
> >> - bytes("\x80"), you get an error, encoding is required. There is no
> >> such thing as "default encoding" anymore, as there's no str object.
> >> - bytes("\x80", encoding="latin-1"), you get a bytestring with a
> >> single byte of value 0x80.
> >
> > Yes to both again.
>
> I haven't been following this dicussion about bytes() real closely
> but I don't think that bytes() should do the encoding.  We already
> have a way to spell that:
>
> "\x80".encode('latin-1')

But in 2.5 we can't change that to return a bytes object without
creating HUGE incompatibilities.

In general I've come to appreciate that there are two ways of
converting an object of type A to an object of type B: ask an A
instance to convert itself to a B, or ask the type B to create a new
instance from an A. Depending on what A and B are, both APIs make
sense; sometimes reasons of decoupling require that A can't know about
B, in which case you have to use the latter approach; sometimes B
can't know about A, in which case you have to use the former. Even
when A == B we sometimes support both APIs: to create a new list from
a list a, you can write a[:] or list(a); to create a new dict from a
dict d, you can write d.copy() or dict(d).

An advantage of the latter API is that there's no confusion about the
resulting type -- dict(d) is definitely a dict, and list(a) is
definitely a list. Not so for d.copy() or a[:] -- if the input type is
another mapping or sequence, it'll probably return an object of that
same type.

Again, it depends on the application which is better.

I think that bytes(s, ) is fine, especially for expressing a
new type, since it is unambiguous about the result type, and has no
backwards compatibility issues.

> Also, I think it would useful to introduce byte array literals at
> the same time as the bytes object.  That would allow people to use
> byte arrays without having to get involved with all the silly string
> encoding confusion.

You missed the part where I said that introducing the bytes type
*without* a literal seems to be a good first step. A new type, even
built-in, is much less drastic than a new literal (which requires
lexer and parser support in addition to everything else).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Barry Warsaw
On Feb 13, 2006, at 7:29 PM, Guido van Rossum wrote:

> There's one property that bytes, str and unicode all share: type(x[0])
> == type(x), at least as long as len(x) >= 1. This is perhaps the
> ultimate test for string-ness.

But not perfect, since of course other containers can contain objects  
of their own type too.  But it leads to an interesting issue...

> Or should b[0] be an int, if b is a bytes object? That would change
> things dramatically.

This makes me think I want an unsigned byte type, which b[0] would  
return.  In another thread I think someone mentioned something about  
fixed width integral types, such that you could have an object that  
was guaranteed to be 8-bits wide, 16-bits wide, etc.   Maybe you also  
want signed and unsigned versions of each.  This may seem like YAGNI  
to many people, but as I've been working on a tightly embedded/ 
extended application for the last few years, I've definitely had  
occasions where I wish I could more closely and more directly model  
my C values as Python objects (without using the standard workarounds  
or writing my own C extension types).

But anyway, without hyper-generalizing, it's still worth asking  
whether a bytes type is just a container of byte objects, where the  
contained objects would be distinct, fixed 8-bit unsigned integral  
types.

> There's also the consideration for APIs that, informally, accept
> either a string or a sequence of objects. Many of these exist, and
> they are probably all being converted to support unicode as well as
> str (if it makes sense at all). Should a bytes object be considered as
> a sequence of things, or as a single thing, from the POV of these
> types of APIs? Should we try to standardize how code tests for the
> difference? (Currently all sorts of shortcuts are being taken, from
> isinstance(x, (list, tuple)) to isinstance(x, basestring).)

I think bytes objects are very much like string objects today --  
they're the photons of Python since they can act like either  
sequences or scalars, depending on the context.  For example, we have  
code that needs to deal with situations where an API can return  
either a scalar or a sequence of those scalars.  So we have a utility  
function like this:

def thingiter(obj):
 try:
 it = iter(obj)
 except TypeError:
 yield obj
 else:
 for item in it:
 yield item

Maybe there's a better way to do this, but the most obvious problem  
is that (for our use cases), this fails for strings because in this  
context we want strings to act like scalars.  So we add a little test  
just before the "try:" like "if isinstance(obj, basestring): yield  
obj".  But that's yucky.

I don't know what the solution is -- if there /is/ a solution short  
of special case tests like above, but I think the key observation is  
that sometimes you want your string to act like a sequence and  
sometimes you want it to act like a scalar.  I suspect bytes objects  
will be the same way.

-Barry


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Phillip J. Eby
At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
>On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> > I didn't mean that it was the only purpose.  In Python 2.x, practical code
> > has to sometimes deal with "string-like" objects.  That is, code that takes
> > either strings or unicode.  If such code calls bytes(), it's going to want
> > to include an encoding so that unicode conversions won't fail.
>
>That sounds like a rather hypothetical example. Have you thought it
>through? Presumably code that accepts both str and unicode either
>doesn't care about encodings, but simply returns objects of the same
>type as the arguments -- and then it's unlikely to want to convert the
>arguments to bytes; or it *does* care about encodings, and then it
>probably already has to special-case str vs. unicode because it has to
>control how str objects are interpreted.

Actually, it's the other way around.  Code that wants to output 
uninterpreted bytes right now and accepts either strings or Unicode has to 
special-case *unicode* -- not str, because str is the only "bytes type" we 
currently have.

This creates an interesting issue in WSGI for Jython, which of course only 
has one (unicode-based) string type now.  Since there's no bytes type in 
Python in general, the only solution we could come up with was to treat 
such strings as latin-1:

 http://www.python.org/peps/pep-0333.html#unicode-issues

This is why I'm biased towards latin-1 encoding of unicode to bytes; it's 
"the same thing" as an uninterpreted string of bytes.

I think the difference in our viewpoints is that you're still thinking 
"string" thoughts, whereas I'm thinking "byte" thoughts.  Bytes are just 
bytes; they don't *have* an encoding.

So, if you think of "converting a string to bytes" as meaning "create an 
array of numerals corresponding to the characters in the string", then this 
leads to a uniform result whether the characters are in a str or a unicode 
object.  In other words, to me, bytes(str_or_unicode) should be treated as:

 bytes(map(ord, str_or_unicode))

In other words, without an encoding, bytes() should simply treat str and 
unicode objects *as if they were a sequence of integers*, and produce an 
error when an integer is out of range.  This is a logical and consistent 
interpretation in the absence of an encoding, because in that case you 
don't care about the encoding - it's just raw data.

If, however, you include an encoding, then you're stating that you want to 
encode the *meaning* of the string, not merely its integer values.


>What would bytes("abc\xf0", "latin-1") *mean*? Take the string
>"abc\xf0", interpret it as being encoded in XXX, and then encode from
>XXX to Latin-1. But what's XXX? As I showed in a previous post,
>"abc\xf0".encode("latin-1") *fails* because the source for the
>encoding is assumed to be ASCII.

I'm saying that XXX would be the same encoding as you specified.  i.e., 
including an encoding means you are encoding the *meaning* of the string.

However, I believe I mainly proposed this as an alternative to having 
bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I 
think is probably a saner default.


>Your argument for symmetry would be a lot stronger if we used Latin-1
>for the conversion between str and Unicode. But we don't.

But that's because we're dealing with its meaning *as a string*, not merely 
as ordinals in a sequence of bytes.


>  I like the
>other interpretation (which I thought was yours too?) much better: str
><--> bytes conversions don't use encodings by simply change the type
>without changing the bytes;

I like it better too.  The part you didn't like was where MAL and I believe 
this should be extended to Unicode characters in the 0-255 range also.  :)


>There's one property that bytes, str and unicode all share: type(x[0])
>== type(x), at least as long as len(x) >= 1. This is perhaps the
>ultimate test for string-ness.
>
>Or should b[0] be an int, if b is a bytes object? That would change
>things dramatically.

+1 for it being an int.  Heck, I'd want to at least consider the 
possibility of introducing a character type (chr?) in Python 3.0, and 
getting rid of the "iterating a string yields strings" 
characteristic.  I've found it to be a bit of a pain when dealing with 
heterogeneous nested sequences that contain strings.


>There's also the consideration for APIs that, informally, accept
>either a string or a sequence of objects. Many of these exist, and
>they are probably all being converted to support unicode as well as
>str (if it makes sense at all). Should a bytes object be considered as
>a sequence of things, or as a single thing, from the POV of these
>types of APIs? Should we try to standardize how code tests for the
>difference? (Currently all sorts of shortcuts are being taken, from
>isinstance(x, (list, tuple)) to isinstance(x, basestring).)

I'm inclined to think of certain features at least in terms of the buffer 
interface, but 

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Martin v. Löwis
M.-A. Lemburg wrote:
> We're talking about Py3k here: "abc" will be a Unicode string,
> so why restrict the conversion to 7 bits when you can have 8 bits
> without any conversion problems ?

YAGNI. If you have a need for byte string in source code, it will
typically be "random" bytes, which can be nicely used through

  bytes([0x73, 0x9f, 0x44, 0xd2, 0xfb, 0x49, 0xa3, 0x14,  0x8b, 0xee])

For larger blocks, people should use base64.string_to_bytes (which
can become a synonym for base64.decodestring in Py3k).

If you have bytes that are meaningful text for some application
(say, a wire protocol), it is typically ASCII-Text. No protocol
I know of uses non-ASCII characters for protocol information.

Of course, you need a way to get .encode output as bytes somehow,
both in 2.5, and in Py3k. I suggest writing

  bytes(s.encode(encoding))

In 2.5, bytes() can be constructed from strings, and will do a
conversion; in Py3k, .encode will already return a string, so
this will be a no-op.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Martin v. Löwis
Phillip J. Eby wrote:
> I was just pointing out that since byte strings are bytes by definition, 
> then simply putting those bytes in a bytes() object doesn't alter the 
> existing encoding.  So, using latin-1 when converting a string to bytes 
> actually seems like the the One Obvious Way to do it.

This is a misconception. In Python 2.x, the type str already *is* a
bytes type. So if S is an instance of 2.x str, bytes(S) does not need
to do any conversion. You don't need to assume it is latin-1: it's
already bytes.

> In fact, the 'encoding' argument seems useless in the case of str objects, 
> and it seems it should default to latin-1 for unicode objects.

I agree with the former, but not with the latter. There shouldn't be a
conversion of Unicode objects to bytes at all. If you want bytes from
a Unicode string U, write

  bytes(U.encode(encoding))

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Martin v. Löwis
Guido van Rossum wrote:
>>In py3k, when the str object is eliminated, then what do you have?
>>Perhaps
>>- bytes("\x80"), you get an error, encoding is required. There is no
>>such thing as "default encoding" anymore, as there's no str object.
>>- bytes("\x80", encoding="latin-1"), you get a bytestring with a
>>single byte of value 0x80.
> 
> 
> Yes to both again.

Please reconsider, and don't give bytes() an encoding= argument.
It doesn't need one. In Python 3, people should write

  "\x80".encode("latin-1")

if they absolutely want to, although they better write

  bytes([0x80])

Now, the first form isn't valid in 2.5, but

  bytes(u"\x80".encode("latin-1"))

could work in all versions.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Adam Olsen
On 2/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> M.-A. Lemburg wrote:
> > We're talking about Py3k here: "abc" will be a Unicode string,
> > so why restrict the conversion to 7 bits when you can have 8 bits
> > without any conversion problems ?
>
> YAGNI. If you have a need for byte string in source code, it will
> typically be "random" bytes, which can be nicely used through
>
>   bytes([0x73, 0x9f, 0x44, 0xd2, 0xfb, 0x49, 0xa3, 0x14,  0x8b, 0xee])
>
> For larger blocks, people should use base64.string_to_bytes (which
> can become a synonym for base64.decodestring in Py3k).
>
> If you have bytes that are meaningful text for some application
> (say, a wire protocol), it is typically ASCII-Text. No protocol
> I know of uses non-ASCII characters for protocol information.

What would that imply for repr()?  To support eval(repr(x)) it would
have to produce whatever format the source code includes to begin
with.

If I understand correctly there's three main candidates:
1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x
2. Direct copying to str/unicode if it's only ascii values, switching
to a list of hex literals if there's any non-ascii values
3. b"foo" literal with ascii for all ascii characters (other than \
and "), \xFF for individual characters that aren't ascii

Given the choice I prefer the third option, with the second option as
my runner up.  The first option just screams "silent errors" to me.


--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread James Y Knight
On Feb 14, 2006, at 12:20 AM, Phillip J. Eby wrote:
>  bytes(map(ord, str_or_unicode))
>
> In other words, without an encoding, bytes() should simply treat  
> str and
> unicode objects *as if they were a sequence of integers*, and  
> produce an
> error when an integer is out of range.  This is a logical and  
> consistent
> interpretation in the absence of an encoding, because in that case you
> don't care about the encoding - it's just raw data.


If you're talking about "raw data", then make bytes(unicodestring)  
produce what buffer(unicodestring) currently does -- something  
completely and utterly worthless. :) [it depends on how you compiled  
python and what endianness your system has.]

There really is no case where you don't care about the  
encoding...there is always a specific desired output encoding, and  
you have to think about what encoding that is. The argument that  
latin-1 is a sensible default just because you can convert to latin-1  
by chopping off the upper 3 bytes of a unicode character's ordinal  
position is not convincing; you're still doing an encoding operation,  
it just happens to be computationally easy. That Jython programs have  
to pretend that unicode strings are an appropriate way to store  
bytes, and thus often have to do fake "latin-1" conversions which are  
really no such thing, doesn't make a convincing argument either.  
Using unicode strings to store bytes read from or written to a socket  
is really just broken.

Actually having any default encoding at all is IMO a poor idea, but  
as python has one at the moment (ascii), might as well keep using it  
for consistency until it's eliminated (sys.setdefaultencoding 
('undefined') is my friend.)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-13 Thread Martin v. Löwis
Adam Olsen wrote:
> What would that imply for repr()?  To support eval(repr(x))

I don't think eval(repr(x)) needs to be supported for the bytes
type. However, if that is desirable, it should return something
like

  bytes([1,2,3])

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Thomas Wouters
On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote:

> But adding an encoding doesn't help. The str.encode() method always
> assumes that the string itself is ASCII-encoded, and that's not good
> enough:

> >>> "abc".encode("latin-1")
> 'abc'
> >>> "abc".decode("latin-1")
> u'abc'
> >>> "abc\xf0".decode("latin-1")
> u'abc\xf0'
> >>> "abc\xf0".encode("latin-1")
> Traceback (most recent call last):
>   File "", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
> 3: ordinal not in range(128)

These comments disturb me. I never really understood why (byte) strings grew
the 'encode' method, since 8-bit strings *are already encoded*, by their
very nature. I mean, I understand it's useful because Python does
non-unicode encodings like 'hex', but I don't really understand *why*. The
benefits don't seem to outweigh the cost (but that's hindsight.)

Directly encoding a (byte) string into a unicode encoding is mostly useless,
as you've shown. The only use-case I can think of is translating ASCII in,
for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op,
unless the system encoding isn't 'ascii' (and that's pretty rare, and not
something a Python programmer should depend on.) On the other hand, the fact
that (byte) strings have an 'encode' method creates a lot of confusion in
unicode-newbies, and causes programs to break only when input is non-ASCII.
And non-ASCII input just happens too often and too unpredictably in
'real-world' code, and not enough in European programmers' tests ;P

Unicode objects and strings are not the same thing. We shouldn't treat them
as the same thing. They share an interface (like lists and tuples do), and
if you only use that interface, treating them as the same kind object is
mostly ok. They actually share *less* of an interface than lists and tuples,
though, as comparing strings to unicode objects can raise an exception,
whereas comparing lists to tuples is not expected to. For anything less
trivial than indexing, slicing and most of the string methods, and anything
what so ever involving non-ASCII (or, rather, non-system-encoding), unicode
objects and strings *must* be treated separately. For instance, there is no
correct way to do:

  s.split("\x80")

unless you know the type of 's'. If it's unicode, you want u"\x80" instead
of "\x80". If it's not unicode, splitting "\x80" may not even be sensible,
but you wouldn't know from looking at the code -- maybe it expects a
specific encoding (or encoding family), maybe not. As soon as you deal with
unicode, you need to really understand the concept, and too many programmers
don't. And it's very hard to tell from someone's comments whether they fail
to understand or just get some of the terminology wrong; that's why Guido's
comments about 'encoding a byte string' and 'what if the file encoding is
Unicode' scare me. The unicode/string mixup almost makes me wish Python
was statically typed.

So please, please, please don't make the mistake of 'doing something' with
the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string.
It wouldn't actually be usable except for the same things as 'str.encode':
to convert from ASCII to non-ASCII-supersets, or to convert to non-unicode
encodings (such as 'hex'.) You can achieve those two by doing, e.g.,
'bytes(s.encode('hex'))' if you really want to. Ignoring the encoding
(rather than raising an exception) would also allow code to be trivially
portable between Python 2.x and Py3K, when "" is actually a unicode object.

Not that I'm happy with ignoring anything, but not ignoring would be bigger
crime here.

Oh, and while on the subject, I'm not convinced going all-unicode in Py3K is
a good idea either, but maybe I should save that discussion for PyCon. I'm
not thinking "why do we need unicode" anymore (which I did two years ago ;)
but I *am* thinking it'll be a big step for 90% of the programmers if they
have to grasp unicode and encodings to be able to even do 'raw_input()'
sensibly. I know I spend an inordinate amount of time trying to explain the
basics on #python on irc.freenode.net already.

-- 
Thomas Wouters <[EMAIL PROTECTED]>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Guido van Rossum wrote:

> I also wonder if having a b"..." literal would just add more confusion
> -- bytes are not characters, but b"..." makes it appear as if they
> are.

I'm inclined to agree. Bytes objects are more likely to be used
for things which are *not* characters -- if they're characters,
they would be better kept in strings or char arrays.

+1 on any eventual bytes literal looking completely different
from a string literal.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Guido van Rossum wrote:

> There's also the consideration for APIs that, informally, accept
> either a string or a sequence of objects.

My preference these days is not to design APIs that
way. It's never necessary and it avoids a lot of
problems.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Barry Warsaw wrote:

> This makes me think I want an unsigned byte type, which b[0] would  
> return.

Come to think of it, this is something I don't
remember seeing discussed. I've been thinking
that bytes[i] would return an integer, but is
the intention that it would return another bytes
object?

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Nick Coghlan
Guido van Rossum wrote:
> In general I've come to appreciate that there are two ways of
> converting an object of type A to an object of type B: ask an A
> instance to convert itself to a B, or ask the type B to create a new
> instance from an A.

And the difference between the two isn't even always that clear cut. Sometimes 
you'll ask type B to create a new instance from an A, and then while you're 
not looking type B cheats and goes and asks the A instance to do it instead ;)

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Adam Olsen
On 2/14/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Adam Olsen wrote:
> > What would that imply for repr()?  To support eval(repr(x))
>
> I don't think eval(repr(x)) needs to be supported for the bytes
> type. However, if that is desirable, it should return something
> like
>
>   bytes([1,2,3])

I'm starting to wonder, do we really need anything fancy?  Wouldn't it
be sufficient to have a way to compactly store 8-bit integers?

In 2.x we could convert unicode like this:
bytes(ord(c) for c in u"It's...".encode('utf-8'))
u"It's...".byteencode('utf-8')  # Shortcut for above

In 3.0 it changes to:
"It's...".encode('utf-8')
u"It's...".byteencode('utf-8')  # Same as above, kept for compatibility

Passing a str or unicode directly to bytes() would be an error. 
repr(bytes(...)) would produce bytes([1,2,3]).

Probably need a __bytes__() method that print can call, or even better
a __print__(file) method[0].  The write() methods would of course have
to support bytes objects.

I realize it would be odd for the interactive interpret to print them
as a list of ints by default:
>>> u"It's...".byteencode('utf-8')
[73, 116, 39, 115, 46, 46, 46]
But maybe it's time we stopped hiding the real nature of bytes from users?


[0] By this I mean calling objects recursively and telling them what
file to print to, rather than getting a temporary string from them and
printing that.  I always wondered why you could do that from C
extensions but not from Python code.

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Michael Hudson
Greg Ewing <[EMAIL PROTECTED]> writes:

> Guido van Rossum wrote:
>
>> There's also the consideration for APIs that, informally, accept
>> either a string or a sequence of objects.
>
> My preference these days is not to design APIs that
> way. It's never necessary and it avoids a lot of
> problems.

Oh yes.

Cheers,
mwh

-- 
  ZAPHOD:  Listen three eyes, don't try to outweird me, I get stranger
   things than you free with my breakfast cereal.
-- The Hitch-Hikers Guide to the Galaxy, Episode 7
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Barry Warsaw
On Feb 14, 2006, at 6:35 AM, Greg Ewing wrote:

> Barry Warsaw wrote:
>
>> This makes me think I want an unsigned byte type, which b[0] would
>> return.
>
> Come to think of it, this is something I don't
> remember seeing discussed. I've been thinking
> that bytes[i] would return an integer, but is
> the intention that it would return another bytes
> object?

A related question: what would bytes([104, 101, 108, 108, 111, 8004])  
return?  An exception hopefully.  I also think you'd want bytes([x  
for x in some_bytes_object]) to return an object equal to the original.

-Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread James Y Knight

On Feb 14, 2006, at 1:52 AM, Martin v. Löwis wrote:

> Phillip J. Eby wrote:
>> I was just pointing out that since byte strings are bytes by  
>> definition,
>> then simply putting those bytes in a bytes() object doesn't alter the
>> existing encoding.  So, using latin-1 when converting a string to  
>> bytes
>> actually seems like the the One Obvious Way to do it.
>
> This is a misconception. In Python 2.x, the type str already *is* a
> bytes type. So if S is an instance of 2.x str, bytes(S) does not need
> to do any conversion. You don't need to assume it is latin-1: it's
> already bytes.
>
>> In fact, the 'encoding' argument seems useless in the case of str  
>> objects,
>> and it seems it should default to latin-1 for unicode objects.
>
> I agree with the former, but not with the latter. There shouldn't be a
> conversion of Unicode objects to bytes at all. If you want bytes from
> a Unicode string U, write
>
>   bytes(U.encode(encoding))

I like it, it makes sense. Unicode strings are simply not allowed as  
arguments to the byte constructor. Thinking about it, why would it be  
otherwise? And if you're mixing str-strings and unicode-strings, that  
means the str-strings you're sometimes giving are actually not byte  
strings, but character strings anyhow, so you should be encoding  
those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling.

Kill the encoding argument, and you're left with:

Python2.X:
- bytes(bytes_object) -> copy constructor
- bytes(str_object) -> copy the bytes from the str to the bytes object
- bytes(sequence_of_ints) -> make bytes with the values of the ints,  
error on overflow

Python3.X removes str, and most APIs that did return str return bytes  
instead. Now all you have is:
- bytes(bytes_object) -> copy constructor
- bytes(sequence_of_ints) -> make bytes with the values of the ints,  
error on overflow

Nice and simple.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Phillip J. Eby
At 11:08 AM 2/14/2006 -0500, James Y Knight wrote:

>On Feb 14, 2006, at 1:52 AM, Martin v. Löwis wrote:
>
>>Phillip J. Eby wrote:
>>>I was just pointing out that since byte strings are bytes by
>>>definition,
>>>then simply putting those bytes in a bytes() object doesn't alter the
>>>existing encoding.  So, using latin-1 when converting a string to
>>>bytes
>>>actually seems like the the One Obvious Way to do it.
>>
>>This is a misconception. In Python 2.x, the type str already *is* a
>>bytes type. So if S is an instance of 2.x str, bytes(S) does not need
>>to do any conversion. You don't need to assume it is latin-1: it's
>>already bytes.
>>
>>>In fact, the 'encoding' argument seems useless in the case of str
>>>objects,
>>>and it seems it should default to latin-1 for unicode objects.
>>
>>I agree with the former, but not with the latter. There shouldn't be a
>>conversion of Unicode objects to bytes at all. If you want bytes from
>>a Unicode string U, write
>>
>>   bytes(U.encode(encoding))
>
>I like it, it makes sense. Unicode strings are simply not allowed as
>arguments to the byte constructor. Thinking about it, why would it be
>otherwise? And if you're mixing str-strings and unicode-strings, that
>means the str-strings you're sometimes giving are actually not byte
>strings, but character strings anyhow, so you should be encoding
>those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling.

Actually, I think you mean:

 if isinstance(s_or_U, str):
 s_or_U = s_or_U.decode('utf-8')

 b = bytes(s_or_U.encode('utf-8'))

Or maybe:

 if isinstance(s_or_U, unicode):
 s_or_U = s_or_U.encode('utf-8')

 b = bytes(s_or_U)

Which is why I proposed that the boilerplate logic get moved *into* the 
bytes constructor.  I think this use case is going to be common in today's 
Python, but in truth I'm not as sure what bytes() will get used *for* in 
today's Python.  I'm probably overprojecting based on the need to use str 
objects now, but bytes aren't going to be a replacement for str for a good 
while anyway.


>Kill the encoding argument, and you're left with:
>
>Python2.X:
>- bytes(bytes_object) -> copy constructor
>- bytes(str_object) -> copy the bytes from the str to the bytes object
>- bytes(sequence_of_ints) -> make bytes with the values of the ints,
>error on overflow
>
>Python3.X removes str, and most APIs that did return str return bytes
>instead. Now all you have is:
>- bytes(bytes_object) -> copy constructor
>- bytes(sequence_of_ints) -> make bytes with the values of the ints,
>error on overflow
>
>Nice and simple.

I could certainly live with that approach, and it certainly rules out all 
the "when does the encoding argument apply and when should it be an error 
to pass it" questions.  :)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread M.-A. Lemburg
James Y Knight wrote:
> Kill the encoding argument, and you're left with:
> 
> Python2.X:
> - bytes(bytes_object) -> copy constructor
> - bytes(str_object) -> copy the bytes from the str to the bytes object
> - bytes(sequence_of_ints) -> make bytes with the values of the ints,  
> error on overflow
> 
> Python3.X removes str, and most APIs that did return str return bytes  
> instead. Now all you have is:
> - bytes(bytes_object) -> copy constructor
> - bytes(sequence_of_ints) -> make bytes with the values of the ints,  
> error on overflow
> 
> Nice and simple.

Albeit, too simple.

The above approach would basically remove the possibility to easily
create bytes() from literals in Py3k, since literals in Py3k create
Unicode objects, e.g. bytes("123") would not work in Py3k.

It's hard to imagine how you'd provide a decent upgrade path
for bytes() if you introduce the above semantics in Py2.x.

People would start writing bytes("123") in Py2.x and expect
it to also work in Py3k, which it wouldn't.

To prevent this, you'd have to outrule bytes() construction
from strings altogether, which doesn't look like a viable
option either.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 14 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Josiah Carlson

James Y Knight <[EMAIL PROTECTED]> wrote:
> I like it, it makes sense. Unicode strings are simply not allowed as  
> arguments to the byte constructor. Thinking about it, why would it be  
> otherwise? And if you're mixing str-strings and unicode-strings, that  
> means the str-strings you're sometimes giving are actually not byte  
> strings, but character strings anyhow, so you should be encoding  
> those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling.

I also like the removal of the encoding...

> Kill the encoding argument, and you're left with:
> 
> Python2.X:
> - bytes(bytes_object) -> copy constructor
> - bytes(str_object) -> copy the bytes from the str to the bytes object
> - bytes(sequence_of_ints) -> make bytes with the values of the ints,  
> error on overflow
> 
> Python3.X removes str, and most APIs that did return str return bytes  
> instead. Now all you have is:
> - bytes(bytes_object) -> copy constructor
> - bytes(sequence_of_ints) -> make bytes with the values of the ints,  
> error on overflow

What's great is that this already works:

>>> import array
>>> array.array('b', [1,2,3])
array('b', [1, 2, 3])
>>> array.array('b', "hello")
array('b', [104, 101, 108, 108, 111])
>>> array.array('b', u"hello")
Traceback (most recent call last):
  File "", line 1, in ?
TypeError: array initializer must be list or string
>>> array.array('b', [150])
Traceback (most recent call last):
  File "", line 1, in ?
OverflowError: signed char is greater than maximum
>>> array.array('B', [150])
array('B', [150])
>>> array.array('B', [350])
Traceback (most recent call last):
  File "", line 1, in ?
OverflowError: unsigned byte integer is greater than maximum


And out of the deal we can get both signed and unsigned ints.

Re: Adam Olsen
> I'm starting to wonder, do we really need anything fancy?  Wouldn't it
> be sufficient to have a way to compactly store 8-bit integers?

It already exists.  It could just use another interface.  The buffer
interface offers any array the ability to return strings.  That may have
to change to return bytes objects in Py3k.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread M.-A. Lemburg
Guido van Rossum wrote:
> On 2/13/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> Guido van Rossum wrote:
>>> It'd be cruel and unusual punishment though to have to write
>>>
>>>   bytes("abc", "Latin-1")
>>>
>>> I propose that the default encoding (for basestring instances) ought
>>> to be "ascii" just like everywhere else. (Meaning, it should really be
>>> the system default encoding, which defaults to "ascii" and is
>>> intentionally hard to change.)
>> We're talking about Py3k here: "abc" will be a Unicode string,
>> so why restrict the conversion to 7 bits when you can have 8 bits
>> without any conversion problems ?
> 
> As Phillip guessed, I was indeed thinking about introducing bytes()
> sooner than that, perhaps even in 2.5 (though I don't want anything
> rushed).

Hmm, that is probably going to be too early. As the thread shows
there are lots of things to take into account, esp. since if you
plan to introduce byte() in 2.x, the upgrade path to 3.x would
have to be carefully planned. Otherwise, we end up introducing
a feature which is meant to prepare for 3.x and then we end up
causing breakage when the move is finally implemented.

> Even in Py3k though, the encoding issue stands -- what if the file
> encoding is Unicode? Then using Latin-1 to encode bytes by default
> might not by what the user expected. Or what if the file encoding is
> something totally different? (Cyrillic, Greek, Japanese, Klingon.)
> Anything default but ASCII isn't going to work as expected. ASCII
> isn't going to work as expected either, but it will complain loudly
> (by throwing a UnicodeError) whenever you try it, rather than causing
> subtle bugs later.

I think there's a misunderstanding here: in Py3k, all "string"
literals will be converted from the source code encoding to
Unicode. There are no ambiguities - a Klingon character will still
map to the same ordinal used to create the byte content regardless
of whether the source file is encoded in UTF-8, UTF-16 or
some Klingon charset (are there any ?).

Furthermore, by restricting to ASCII you'd also outrule hex escapes
which seem to be the natural choice for presenting binary data in
literals - the Unicode representation would then only be an
implementation detail of the way Python treats "string" literals
and a user would certainly expect to find e.g. \x88 in the bytes object
if she writes bytes('\x88').

But maybe you have something different in mind... I'm talking
about ways to create bytes() in Py3k using "string" literals.

>> While we're at it: I'd suggest that we remove the auto-conversion
>> from bytes to Unicode in Py3k and the default encoding along with
>> it.
> 
> I'm not sure which auto-conversion you're talking about, since there
> is no bytes type yet. If you're talking about the auto-conversion from
> str to unicode: the bytes type should not be assumed to have *any*
> properties that the current str type has, and that includes
> auto-conversion.

I was talking about the automatic conversion of 8-bit strings to
Unicode - which was a key feature to make the introduction of
Unicode less painful, but will no longer be necessary in Py3k.

>> In Py3k the standard lib will have to be Unicode compatible
>> anyway and string parser markers like "s#" will have to go away
>> as well, so there's not much need for this anymore.
>>
>> (Maybe a bit radical, but I guess that's what Py3k is meant for.)
> 
> Right.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 14 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread James Y Knight
On Feb 14, 2006, at 11:47 AM, M.-A. Lemburg wrote:
> The above approach would basically remove the possibility to easily
> create bytes() from literals in Py3k, since literals in Py3k create
> Unicode objects, e.g. bytes("123") would not work in Py3k.

That is true. And I think that is correct. There should be b"string"  
syntax.

> It's hard to imagine how you'd provide a decent upgrade path
> for bytes() if you introduce the above semantics in Py2.x.
>
> People would start writing bytes("123") in Py2.x and expect
> it to also work in Py3k, which it wouldn't.

Agreed, it won't work.

> To prevent this, you'd have to outrule bytes() construction
> from strings altogether, which doesn't look like a viable
> option either.

I don't think you have to do that, you just have to provide b"string".

I'd like to point out that the previous proposal had the same issue:

On Feb 13, 2006, at 8:11 PM, Guido van Rossum wrote:
> On 2/13/06, James Y Knight <[EMAIL PROTECTED]> wrote:
>> In py3k, when the str object is eliminated, then what do you have?
>> Perhaps
>> - bytes("\x80"), you get an error, encoding is required. There is no
>> such thing as "default encoding" anymore, as there's no str object.
>> - bytes("\x80", encoding="latin-1"), you get a bytestring with a
>> single byte of value 0x80.
>>
>
> Yes to both again.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread James Y Knight

On Feb 14, 2006, at 11:25 AM, Phillip J. Eby wrote:
> At 11:08 AM 2/14/2006 -0500, James Y Knight wrote:
>> I like it, it makes sense. Unicode strings are simply not allowed as
>> arguments to the byte constructor. Thinking about it, why would it be
>> otherwise? And if you're mixing str-strings and unicode-strings, that
>> means the str-strings you're sometimes giving are actually not byte
>> strings, but character strings anyhow, so you should be encoding
>> those too. bytes(s_or_U.encode('utf-8')) is a perfectly good  
>> spelling.
> Actually, I think you mean:
>
> if isinstance(s_or_U, str):
> s_or_U = s_or_U.decode('utf-8')
>
> b = bytes(s_or_U.encode('utf-8'))
>
> Or maybe:
>
> if isinstance(s_or_U, unicode):
> s_or_U = s_or_U.encode('utf-8')
>
> b = bytes(s_or_U)
>
> Which is why I proposed that the boilerplate logic get moved *into*  
> the bytes constructor.  I think this use case is going to be common  
> in today's Python, but in truth I'm not as sure what bytes() will  
> get used *for* in today's Python.  I'm probably overprojecting  
> based on the need to use str objects now, but bytes aren't going to  
> be a replacement for str for a good while anyway.


I most certainly *did not* mean that. If you are mixing together str  
and unicode instances, the str instances _must be_ in the default  
encoding (ascii). Otherwise, you are bound for failure anyhow, e.g.  
''.join(['\x95', u'1']). Str is used for two things right now: 1) a  
byte string. 2) a unicode string restricted to 7bit ASCII. These two  
uses are separate and you cannot mix them without causing disaster.

You've created an interface which can take either a utf8 byte-string,  
or unicode character string. But that's wrong and can only cause  
problems. It should take either an encoded bytestring, or a unicode  
character string. Not both. If it takes a unicode character string,  
there are two ways of spelling that in current python: a "str" object  
with only ASCII in it, or a "unicode" object with arbitrary  
characters in it. bytes(s_or_U.encode('utf-8')) works correctly with  
both.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Neil Schemenauer
On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote:
> On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
> > "\x80".encode('latin-1')
> 
> But in 2.5 we can't change that to return a bytes object without
> creating HUGE incompatibilities.

People could spell it bytes(s.encode('latin-1')) in order to make it
work in 2.X.  That spelling would provide a way of ensuring the type
of the return value.

> You missed the part where I said that introducing the bytes type
> *without* a literal seems to be a good first step. A new type, even
> built-in, is much less drastic than a new literal (which requires
> lexer and parser support in addition to everything else).

Are you concerned about the implementation effort?  If so, I don't
think that's justified since adding a new string prefix should be
pretty straightforward (relative to rest of the effort involved).
Are you comfortable with the proposed syntax?

  Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
> >>In py3k, when the str object is eliminated, then what do you have?
> >>Perhaps
> >>- bytes("\x80"), you get an error, encoding is required. There is no
> >>such thing as "default encoding" anymore, as there's no str object.
> >>- bytes("\x80", encoding="latin-1"), you get a bytestring with a
> >>single byte of value 0x80.
> >
> > Yes to both again.
>
> Please reconsider, and don't give bytes() an encoding= argument.
> It doesn't need one. In Python 3, people should write
>
>   "\x80".encode("latin-1")
>
> if they absolutely want to, although they better write
>
>   bytes([0x80])
>
> Now, the first form isn't valid in 2.5, but
>
>   bytes(u"\x80".encode("latin-1"))
>
> could work in all versions.

In 3.0, I agree that .encode() should return a bytes object.

I'd almost be convinced that in 2.x bytes() doesn't need an encoding
argument, except it will require excessive copying.
bytes(u.encode("utf8")) will certainly use 2*len(u) bytes  space (plus
a constant); bytes(u, "utf8") only needs len(u) bytes. In 3.0,
bytes(s.encode(xxx)) would also create an extra copy, since the bytes
type is mutable (we all agree on that, don't we?).

I think that's a good enough argument for 2.x. We could keep the
extended API as an alternative form in 3.x, or automatically translate
calls to bytes(x, y) into x.encode(y).

BTW I think we'll need a new PEP instead of PEP 332. The latter has
almost no details relevant to this discussion, and it seems to treat
bytes as a near-synonym for str in 2.x. That's not the way this
discussion is going it seems.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, Thomas Wouters <[EMAIL PROTECTED]> wrote:
> On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote:
>
> > But adding an encoding doesn't help. The str.encode() method always
> > assumes that the string itself is ASCII-encoded, and that's not good
> > enough:
>
> > >>> "abc".encode("latin-1")
> > 'abc'
> > >>> "abc".decode("latin-1")
> > u'abc'
> > >>> "abc\xf0".decode("latin-1")
> > u'abc\xf0'
> > >>> "abc\xf0".encode("latin-1")
> > Traceback (most recent call last):
> >   File "", line 1, in ?
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
> > 3: ordinal not in range(128)

(Note that I've since been convinced that bytes(s) where type(s) ==
str should just return a bytes object containing the same bytes as s,
regardless of encoding. So basically you're preaching to the choir
now. The only remaining question is what if anything to do with an
encoding argment when the first argument is of type str...)

> These comments disturb me. I never really understood why (byte) strings grew
> the 'encode' method, since 8-bit strings *are already encoded*, by their
> very nature. I mean, I understand it's useful because Python does
> non-unicode encodings like 'hex', but I don't really understand *why*. The
> benefits don't seem to outweigh the cost (but that's hindsight.)

It may also have something to do with Jython compatibility (which has
str and unicode being the same thing) or 3.0 future-proofing.

> Directly encoding a (byte) string into a unicode encoding is mostly useless,
> as you've shown. The only use-case I can think of is translating ASCII in,
> for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op,
> unless the system encoding isn't 'ascii' (and that's pretty rare, and not
> something a Python programmer should depend on.) On the other hand, the fact
> that (byte) strings have an 'encode' method creates a lot of confusion in
> unicode-newbies, and causes programs to break only when input is non-ASCII.
> And non-ASCII input just happens too often and too unpredictably in
> 'real-world' code, and not enough in European programmers' tests ;P

Oh, there are lots of ways that non-ASCII input can break code, you
don't have to invoke encode() on str objects to get that effect. :/

> Unicode objects and strings are not the same thing. We shouldn't treat them
> as the same thing.

Well in 3.0 they *will* be the same thing, and in Jython they already are.

> They share an interface (like lists and tuples do), and
> if you only use that interface, treating them as the same kind object is
> mostly ok. They actually share *less* of an interface than lists and tuples,
> though, as comparing strings to unicode objects can raise an exception,
> whereas comparing lists to tuples is not expected to.

No, it causes silent surprises since [1,2,3] != (1,2,3).

> For anything less
> trivial than indexing, slicing and most of the string methods, and anything
> what so ever involving non-ASCII (or, rather, non-system-encoding), unicode
> objects and strings *must* be treated separately. For instance, there is no
> correct way to do:
>
>   s.split("\x80")
>
> unless you know the type of 's'. If it's unicode, you want u"\x80" instead
> of "\x80". If it's not unicode, splitting "\x80" may not even be sensible,
> but you wouldn't know from looking at the code -- maybe it expects a
> specific encoding (or encoding family), maybe not. As soon as you deal with
> unicode, you need to really understand the concept, and too many programmers
> don't. And it's very hard to tell from someone's comments whether they fail
> to understand or just get some of the terminology wrong; that's why Guido's
> comments about 'encoding a byte string' and 'what if the file encoding is
> Unicode' scare me. The unicode/string mixup almost makes me wish Python
> was statically typed.

I'm mostly trying to reflect various broken mental models that users
may have. Believe me, my own confusion is nothing compared to the
confusion that occurs in less gifted users. :-)

The only use case for mixing ASCII and Unicode that I *wanted* to work
right was the mixing of pure ASCII strings (typically literals) with
Unicode data. And that works.

Where things unfortunately fall flat is when you start reading data
from files or interactive input and it gives you some encoded str
object instead of a Unicode object. Our mistake was that we didn't
foresee this clearly enough. Perhaps open(filename).read(), where the
file contains non-ASCII bytes, should have been changed to either
return a Unicode string (if an encoding can somehow be guessed), or
raise an exception, rather than returning an str object in some
unknown (and usually unknowable) encoding.

I hope to fix that in 3.0 too, BTW.

> So please, please, please don't make the mistake of 'doing something' with
> the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string.
> It wouldn't actually be usable except for the same things as 'str

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> I'm starting to wonder, do we really need anything fancy?  Wouldn't it
> be sufficient to have a way to compactly store 8-bit integers?
>
> In 2.x we could convert unicode like this:
> bytes(ord(c) for c in u"It's...".encode('utf-8'))

Yuck.

> u"It's...".byteencode('utf-8')  # Shortcut for above

Yuck**2. I'd like to avoid adding new APIs to existing types to return
bytes instead of str. (It's okay to change existing APIs to *accept*
bytes as an alternative to str though.)

> In 3.0 it changes to:
> "It's...".encode('utf-8')
> u"It's...".byteencode('utf-8')  # Same as above, kept for compatibility

No. 3.0 won't have "backward compatibility" features. That's the whole
point of 3.0.

> Passing a str or unicode directly to bytes() would be an error.
> repr(bytes(...)) would produce bytes([1,2,3]).

I'm fine with that.

> Probably need a __bytes__() method that print can call, or even better
> a __print__(file) method[0].  The write() methods would of course have
> to support bytes objects.

Right on the latter.

> I realize it would be odd for the interactive interpret to print them
> as a list of ints by default:
> >>> u"It's...".byteencode('utf-8')
> [73, 116, 39, 115, 46, 46, 46]

No. This prints the repr() which should include the type. bytes([73,
116, 39, 115, 46, 46, 46]) is the right thing to print here.

> But maybe it's time we stopped hiding the real nature of bytes from users?

That's the whole point.

> [0] By this I mean calling objects recursively and telling them what
> file to print to, rather than getting a temporary string from them and
> printing that.  I always wondered why you could do that from C
> extensions but not from Python code.

I want to keep the Python-level API small.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/13/06, Barry Warsaw <[EMAIL PROTECTED]> wrote:
> This makes me think I want an unsigned byte type, which b[0] would
> return.  In another thread I think someone mentioned something about
> fixed width integral types, such that you could have an object that
> was guaranteed to be 8-bits wide, 16-bits wide, etc.   Maybe you also
> want signed and unsigned versions of each.  This may seem like YAGNI
> to many people, but as I've been working on a tightly embedded/
> extended application for the last few years, I've definitely had
> occasions where I wish I could more closely and more directly model
> my C values as Python objects (without using the standard workarounds
> or writing my own C extension types).

So I'm taking that the specific properties you want to model are the
overflow behavior, right? N-bit unsigned is defined as arithmethic mod
2**N; N-bit signed is a bit more tricky to define but similar. These
never overflow but instead just throw away bits in an exactly
specified manner (2's complement arithmetic).

While I personally am comfortable with writing (x+y) & 0x (for
16-bit unsigned), I can see that someone who spends a lot of time
doing arithmetic in this field might want specialized types.

But I'm not sure that that's what the Numeric folks want -- I believe
they're more interested in saving space, not in the mod 2**N
properties. So (here I'm to some extent guessing) they have different
array types whose elements are ints or floats of various widths; I'm
guessing they also have scalars of those widths for consistency or to
guide the creation of new arrays from scalars. I wouldn't be surprised
if, rather than requiring N-bit 2's complement, they would prefer more
flexible control over overflow -- e.g. ignore, warn, error, turn into
NaN, etc.

> But anyway, without hyper-generalizing, it's still worth asking
> whether a bytes type is just a container of byte objects, where the
> contained objects would be distinct, fixed 8-bit unsigned integral
> types.

There's certainly a point to treating bytes as ints; I don't know if
it's more compelling than to treating them as unit bytes. But if we
decide that the bytes types contains ints, b[0] should return a plain
int (whose value necessarily is in range(0, 256)), not some new
unsigned-8-bit type. And creating a bytes object from a list of ints
should accept any input values as long as their __index__ value is in
that same range.

I.e. bytes([1, 2L]) should be the same as bytes([1L, 2]); and
bytes([-1]) should raise a ValueError.

> > There's also the consideration for APIs that, informally, accept
> > either a string or a sequence of objects. Many of these exist, and
> > they are probably all being converted to support unicode as well as
> > str (if it makes sense at all). Should a bytes object be considered as
> > a sequence of things, or as a single thing, from the POV of these
> > types of APIs? Should we try to standardize how code tests for the
> > difference? (Currently all sorts of shortcuts are being taken, from
> > isinstance(x, (list, tuple)) to isinstance(x, basestring).)
>
> I think bytes objects are very much like string objects today --
> they're the photons of Python since they can act like either
> sequences or scalars, depending on the context.  For example, we have
> code that needs to deal with situations where an API can return
> either a scalar or a sequence of those scalars.  So we have a utility
> function like this:
>
> def thingiter(obj):
>  try:
>  it = iter(obj)
>  except TypeError:
>  yield obj
>  else:
>  for item in it:
>  yield item
>
> Maybe there's a better way to do this, but the most obvious problem
> is that (for our use cases), this fails for strings because in this
> context we want strings to act like scalars.  So we add a little test
> just before the "try:" like "if isinstance(obj, basestring): yield
> obj".  But that's yucky.
>
> I don't know what the solution is -- if there /is/ a solution short
> of special case tests like above, but I think the key observation is
> that sometimes you want your string to act like a sequence and
> sometimes you want it to act like a scalar.  I suspect bytes objects
> will be the same way.

I agree it's icky, and I'd rather not design APIs like that -- but I
can't help it that others continue to want to use that idiom. I also
agree that most likely we'll want to treat bytes the same as strings
here. But no basestring (bytes are mutable and don't behave like
sequences of characters).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/13/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> What would that imply for repr()?  To support eval(repr(x)) it would
> have to produce whatever format the source code includes to begin
> with.

I'm not sure that's a requirement. (I do think that in 2.x,
str(bytes(s)) == s should hold as long as type(s) == str.)

> If I understand correctly there's three main candidates:
> 1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x

I'm not sure what you mean, but I'm guessing you're thinking that the
repr() of a bytes object created from bytes('abc\xf0') would be

  bytes('abc\xf0')

under this rule. What's so bad about that?

> 2. Direct copying to str/unicode if it's only ascii values, switching
> to a list of hex literals if there's any non-ascii values

That works for me too. But why hex literals? As MvL stated, a list of
decimals would be just as useful.

> 3. b"foo" literal with ascii for all ascii characters (other than \
> and "), \xFF for individual characters that aren't ascii
>
> Given the choice I prefer the third option, with the second option as
> my runner up.  The first option just screams "silent errors" to me.

The 3rd is out of the running for many reasons.

I'm not sure I understand your "silent errors" fear; can you elaborate?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
> >On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> > > I didn't mean that it was the only purpose.  In Python 2.x, practical code
> > > has to sometimes deal with "string-like" objects.  That is, code that 
> > > takes
> > > either strings or unicode.  If such code calls bytes(), it's going to want
> > > to include an encoding so that unicode conversions won't fail.
> >
> >That sounds like a rather hypothetical example. Have you thought it
> >through? Presumably code that accepts both str and unicode either
> >doesn't care about encodings, but simply returns objects of the same
> >type as the arguments -- and then it's unlikely to want to convert the
> >arguments to bytes; or it *does* care about encodings, and then it
> >probably already has to special-case str vs. unicode because it has to
> >control how str objects are interpreted.
>
> Actually, it's the other way around.  Code that wants to output
> uninterpreted bytes right now and accepts either strings or Unicode has to
> special-case *unicode* -- not str, because str is the only "bytes type" we
> currently have.

But this is assuming that the str input is indeed uninterpreted bytes.
That may be a tacit assumption or agreement but it may be wrong. Also,
there are many ways to interpret "uninterpreted bytes" -- is it an
image, a sound file, or UTF-8 text? In 2 out of those 3, passing
unicode is more likely a bug than anything else (except in Jython).

> This creates an interesting issue in WSGI for Jython, which of course only
> has one (unicode-based) string type now.  Since there's no bytes type in
> Python in general, the only solution we could come up with was to treat
> such strings as latin-1:

I believe that's the general convention in Jython, as it matches the
default (albeit deprecated) conversion between bytes and characters in
Java itself.

>  http://www.python.org/peps/pep-0333.html#unicode-issues
>
> This is why I'm biased towards latin-1 encoding of unicode to bytes; it's
> "the same thing" as an uninterpreted string of bytes.

But in CPython this is not how this is generally done.

> I think the difference in our viewpoints is that you're still thinking
> "string" thoughts, whereas I'm thinking "byte" thoughts.  Bytes are just
> bytes; they don't *have* an encoding.

I think when one side of the equation is Unicode, in CPython, I can be
forgiven for thinking string thoughts, since Unicode is never used to
carry binary bytes in CPython.

You may have to craft some kind of different rule for Jython; it
doesn't have a default encoding used when str meets unicode.

> So, if you think of "converting a string to bytes" as meaning "create an
> array of numerals corresponding to the characters in the string", then this
> leads to a uniform result whether the characters are in a str or a unicode
> object.  In other words, to me, bytes(str_or_unicode) should be treated as:
>
>  bytes(map(ord, str_or_unicode))
>
> In other words, without an encoding, bytes() should simply treat str and
> unicode objects *as if they were a sequence of integers*, and produce an
> error when an integer is out of range.  This is a logical and consistent
> interpretation in the absence of an encoding, because in that case you
> don't care about the encoding - it's just raw data.

I see your point (now that you mentioned Jython). But I still don't
think that this is a good default for CPython.

> If, however, you include an encoding, then you're stating that you want to
> encode the *meaning* of the string, not merely its integer values.

Note that in Python 3000 we won't be using str/unicode to carry
integer values around, since we will have the bytes type. So there, it
makes sense to think of the conversion to always involve an encoding,
possibly a default one. (And I think the default might more usefully
be UTF-8 then.)

> >What would bytes("abc\xf0", "latin-1") *mean*? Take the string
> >"abc\xf0", interpret it as being encoded in XXX, and then encode from
> >XXX to Latin-1. But what's XXX? As I showed in a previous post,
> >"abc\xf0".encode("latin-1") *fails* because the source for the
> >encoding is assumed to be ASCII.
>
> I'm saying that XXX would be the same encoding as you specified.  i.e.,
> including an encoding means you are encoding the *meaning* of the string.

That would be the same as ignoring the encoding argument when the
input is str in CPython 2.x, right? I believe we started out saying we
didn't want to ignore the encoding. Perhaps we need to reconsider
that, given the Jython requirement? Then code that converts str to
bytes and needs to be portable between Jython and CPython could write

  b = bytes(s, "latin-1")

> However, I believe I mainly proposed this as an alternative to having
> bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I
> think is probably a saner default.

Sorry, i still don't buy that

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, Barry Warsaw <[EMAIL PROTECTED]> wrote:
> A related question: what would bytes([104, 101, 108, 108, 111, 8004])
> return?  An exception hopefully.

Absolutely.

> I also think you'd want bytes([x
> for x in some_bytes_object]) to return an object equal to the original.

You mean if types(some_bytes_object) is bytes? Yes. But that doesn't
constrain the API much.

Anyway, I'm now convinced that bytes should act as an array of ints,
where the ints are restricted to range(0, 256) but have type int.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
> > As Phillip guessed, I was indeed thinking about introducing bytes()
> > sooner than that, perhaps even in 2.5 (though I don't want anything
> > rushed).
>
> Hmm, that is probably going to be too early. As the thread shows
> there are lots of things to take into account, esp. since if you
> plan to introduce bytes() in 2.x, the upgrade path to 3.x would
> have to be carefully planned. Otherwise, we end up introducing
> a feature which is meant to prepare for 3.x and then we end up
> causing breakage when the move is finally implemented.

You make a good point. Someone probably needs to write up a new PEP
summarizing this discussion (or rather, consolidating the agreement
that is slowly emerging, where there is agreement, and summarizing the
key open questions).

> > Even in Py3k though, the encoding issue stands -- what if the file
> > encoding is Unicode? Then using Latin-1 to encode bytes by default
> > might not by what the user expected. Or what if the file encoding is
> > something totally different? (Cyrillic, Greek, Japanese, Klingon.)
> > Anything default but ASCII isn't going to work as expected. ASCII
> > isn't going to work as expected either, but it will complain loudly
> > (by throwing a UnicodeError) whenever you try it, rather than causing
> > subtle bugs later.
>
> I think there's a misunderstanding here: in Py3k, all "string"
> literals will be converted from the source code encoding to
> Unicode. There are no ambiguities - a Klingon character will still
> map to the same ordinal used to create the byte content regardless
> of whether the source file is encoded in UTF-8, UTF-16 or
> some Klingon charset (are there any ?).

OK, so a string (literal or otherwise) containing a Klingon character
won't be acceptable to the bytes() constructor in 3.0. It shouldn't be
in 2.x either then.

I still think that someone who types a file in Latin-1 and enters
non-ASCII Latin-1 characters in a string literal and then passes it to
the bytes() constructor might expect to get bytes encoded in Latin-1,
and someone who types a file in UTF-8 and enters non-ASCII Unicode
characters might expect to get UTF-8-encoded bytes. Since they can't
both get what they want, we should disallow both, and only allow
ASCII.

> Furthermore, by restricting to ASCII you'd also outrule hex escapes
> which seem to be the natural choice for presenting binary data in
> literals - the Unicode representation would then only be an
> implementation detail of the way Python treats "string" literals
> and a user would certainly expect to find e.g. \x88 in the bytes object
> if she writes bytes('\x88').

I guess we'l just have to disappoint her. Too bad for the person who
wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to
write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO
and certainly easier than a *mixture* of hex and ASCII like
'\xabc\xdef'.

> But maybe you have something different in mind... I'm talking
> about ways to create bytes() in Py3k using "string" literals.

I'm not sure that's going to be common practive except for ASCII
characters used in network protocols.

> >> While we're at it: I'd suggest that we remove the auto-conversion
> >> from bytes to Unicode in Py3k and the default encoding along with
> >> it.
> >
> > I'm not sure which auto-conversion you're talking about, since there
> > is no bytes type yet. If you're talking about the auto-conversion from
> > str to unicode: the bytes type should not be assumed to have *any*
> > properties that the current str type has, and that includes
> > auto-conversion.
>
> I was talking about the automatic conversion of 8-bit strings to
> Unicode - which was a key feature to make the introduction of
> Unicode less painful, but will no longer be necessary in Py3k.

OK. The bytes type certainly won't have this property.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
> People could spell it bytes(s.encode('latin-1')) in order to make it
> work in 2.X.  That spelling would provide a way of ensuring the type
> of the return value.

At the cost of an extra copying step.

[Guido]
> > You missed the part where I said that introducing the bytes type
> > *without* a literal seems to be a good first step. A new type, even
> > built-in, is much less drastic than a new literal (which requires
> > lexer and parser support in addition to everything else).
>
> Are you concerned about the implementation effort?  If so, I don't
> think that's justified since adding a new string prefix should be
> pretty straightforward (relative to rest of the effort involved).

Not so much the implementation but also the documentation, updating
3rd party Python preprocessors, etc.

> Are you comfortable with the proposed syntax?

Not entirely, since I don't know what b"abcdef" would mean
(where  is a Unicode Euro character typed in whatever source
encoding was used).

Instead of b"abc" (only ASCII) you could write bytes("abc"). Instead
of b"\xf0\xff\xee" you could write bytes([0xf0, 0xff, 0xee]).

The key disconnect for me is that if bytes are not characters, we
shouldn't use a literal notation that resembles the literal notation
for characters. And there's growing consensus that a bytes type should
be considered as an array of (8-bit unsigned) ints.

Also, bytes objects are (in my mind anyway) mutable. We have no other
literal notation for mutable objects. What would the following code
print?

  for i in range(2):
b = b"abc"
print b
b[0] = ord("A")

Would the second output line print abc or Abc?

I guess the only answer that makes sense is that it should print abc
both times; but that means that b"abc" must be internally implemented
by creating a new bytes object each time. Perhaps the implementation
effort isn't so minimal after all...

(PS why is there a reply-to in your email the excludes you from the
list of recipients but includes me?)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Barry Warsaw
On Tue, 2006-02-14 at 15:13 -0800, Guido van Rossum wrote:

> So I'm taking that the specific properties you want to model are the
> overflow behavior, right? N-bit unsigned is defined as arithmethic mod
> 2**N; N-bit signed is a bit more tricky to define but similar. These
> never overflow but instead just throw away bits in an exactly
> specified manner (2's complement arithmetic).

That would be my use case, yep.

> While I personally am comfortable with writing (x+y) & 0x (for
> 16-bit unsigned), I can see that someone who spends a lot of time
> doing arithmetic in this field might want specialized types.

I'd put it in the "annoying, although there exists a workaround that
might confound newbies" category.  Which means it's definitely not
urgent enough to address for 2.5 -- if ever -- especially given your
current stance on bytes(bunch_of_ints)[0].  The two are of course
separate issues, but thinking about one lead to the other.

> But I'm not sure that that's what the Numeric folks want -- I believe
> they're more interested in saving space, not in the mod 2**N
> properties. 

Could be.  I don't care about space savings.  And I definitely have no
clue what the Numeric folks want. ;)

> There's certainly a point to treating bytes as ints; I don't know if
> it's more compelling than to treating them as unit bytes. But if we
> decide that the bytes types contains ints, b[0] should return a plain
> int (whose value necessarily is in range(0, 256)), not some new
> unsigned-8-bit type. And creating a bytes object from a list of ints
> should accept any input values as long as their __index__ value is in
> that same range.
> 
> I.e. bytes([1, 2L]) should be the same as bytes([1L, 2]); and
> bytes([-1]) should raise a ValueError.

That seems fine to me.

> I agree it's icky, and I'd rather not design APIs like that -- but I
> can't help it that others continue to want to use that idiom. I also
> agree that most likely we'll want to treat bytes the same as strings
> here. But no basestring (bytes are mutable and don't behave like
> sequences of characters).

That's interesting.  So bytes really behave a lot more like some weird
string/lists hybrid then? It makes some sense.  You read 801 bytes from
a binary file, twiddle bytes 223 and 741 and then write those bytes back
out to a different binary file.

If we don't inherit from basestring, what I'm worried about is that for
those who do continue to use the idiom described previously, we'll have
to extend our isinstance() to include both basestring and bytes.  Which
definitely gets ickier.  But if bytes are mutable, as make sense, then
it also makes sense that they don't inherit from basestring.

BTW, using that idiom is a bit of a hedge against such API (which you
may not control).  It allows us to say "okay, at /this/ point I don't
know whether I have a scalar or a sequence, but from this point forward,
I know I have something I can safely iterate over."

I wonder if it makes sense to add a more fundamental abstract base class
that can be used as a marker for "photonic behavior".  I don't know what
that class would be called, but you'd then have a hierarchy like this:

photonic
basestring
str
unicode
bytes

OTOH, it seems like a lot to add for a specialized (and some would say
dubious) use case.

-Barry



signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Guido van Rossum
On 2/14/06, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 2/14/06, Neil Schemenauer  wrote:
> > People could spell it bytes(s.encode('latin-1')) in order to make it
> > work in 2.X.
>
> Guido wrote:
> > At the cost of an extra copying step.
>
> That sounds like an implementation issue.  If it is important
> enough to matter, then why not just add some smarts to the
> bytes constructor?

Short answer: you can't.

> If the argument is a str, and the constructor owns the only
> reference, then go ahead and use the argument's own
> underlying array; the string itself will be deallocated when
> (or before) the constructor returns, so no one else can use
> it expecting an immutable.

Hard to explain, but the VM usually keeps an extra reference on the
stack so the refcount is never 1. But you can't rely on that so
assuming that it's safe to reuse the storage if it's >1. Also, since
the str's underlying array is allocated inline with the str header,
this require str and bytes to have the same object layout. But since
bytes are mutable, they can't.

Summary: you don't understand the implementation well enough to
suggest these kinds of things.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Guido van Rossum wrote:

> The only remaining question is what if anything to do with an
> encoding argment when the first argument is of type str...)

 From what you said earlier about str in 2.x being
interpretable as a unicode string which contains
only ascii, it seems to me that if you say

   bytes(s, encoding)

where s is a str, then by the presence of the encoding
argument you're saying that you want s to be treated as
unicode and encoded using the specified encoding.
So the result should be the same as

   bytes(u, encoding)

where u is a unicode string containing the same code
points as s. This implies that it should be an error
if s contains non-ascii characters.

This interpretation would satisfy the requirement for
a single call signature covering both unicode and
str-used-as-ascii-characters, while providing a
different call signature (without encoding) for
str-used-as-bytes.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Guido van Rossum wrote:
> On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> 
>>At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
>>
>>>On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
>>>
>>>What would bytes("abc\xf0", "latin-1") *mean*? 
>>
>>I'm saying that XXX would be the same encoding as you specified.  i.e.,
>>including an encoding means you are encoding the *meaning* of the string.

No, this is wrong. As I understand it, the encoding
argument to bytes() is meant to specify how to *encode*
characters into the bytes object. If you want to be able
to specify how to *decode* a str argument as well, you'd
need a third argument.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Ron Adam
Greg Ewing wrote:
> Guido van Rossum wrote:
>> On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
>>
>>> At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote:
>>>
 On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote:

 What would bytes("abc\xf0", "latin-1") *mean*? 
>>> I'm saying that XXX would be the same encoding as you specified.  i.e.,
>>> including an encoding means you are encoding the *meaning* of the string.
> 
> No, this is wrong. As I understand it, the encoding
> argument to bytes() is meant to specify how to *encode*
> characters into the bytes object. If you want to be able
> to specify how to *decode* a str argument as well, you'd
> need a third argument.

I'm not sure I understand why this would be needed?  But maybe it's 
still too early to pin anything down.

My first impression and thoughts were:  (and seems incorrect now)

 bytes(object) ->  byte sequence of objects value

Basically a "memory dump" of objects value.  And so...

 object(bytes) ->  copy of original object

This would reproduce a copy of the original object as long as the from 
and to object are the same type with no encoding needed.  If they are 
different then you would get garbage, or an error. But that would be a 
programming error and not a language issue. It would be up to the 
programmer to not do that.

Of course this is one of those easier to say than do concepts I'm sure.


And I was thinking a bytes argument of more than one item would indicate 
a byte sequence.

 bytes(1,2,3)  ->  bytes([1,2,3])

Where any values above 255 would give an error,  but it seems an 
explicit list is preferred.  And that's fine because it creates a way 
for bytes to know how to handle everything else. (I think)

bytes([1,2,3]]  -> bytes[(1,2,3)]

Which is fine... so ???

b = bytes(0L) ->  bytes([0,0,0,0])

long(b) ->  0Lconvert it back to 0L

And ...

b = bytes([0L])  ->  bytes([0])  # a single byte

int(b) ->  0convert it back to 0
long(b) ->  0L

It's up to the programmer to know if it's safe. Working with raw data is 
always a programmer needs to be aware of what's going on thing.

But would it be any different with strings?  You wouldn't ever want to 
encode one type's bytes into a different type directly. It would be 
better to just encode it back to the original type, then use *it's* 
encoding method to change it.

so...

   b = bytes(s)  ->  bytes( raw sequence of bytes )

Weather or not you get a single byte per char or multiple bytes per 
character would depend on the strings encoding.

   s = str(bytes, encoding)  ->  original string

You need to specify it here, because there is more than one sting 
encoding. To avoid encodings entirely we would need a type for each 
encoding. (which isn't really avoiding anything) And it's the "raw data 
so programmer needs to be aware" situation again. Don't decode to 
something other than what it is.

If someone needs automatic encoding/decoding, then they probably should 
write a class to do what they want.  Something roughly like...

   class bytekeeper(object):
  b = None
  t = None
  e = None
  def __init__(self, obj, enc='bytes')   # or whatever encoding
 self.e = enc
 self.t = type(obj)
 self.b = bytes(obj)
  def decode(self):
 ...

Would we be able to subclass bytes?

 class bytekeeper(bytes):   ?
...


Ok.. enough rambling... I wonder how much of this is way out in left 
field.  ;)

cheers,
  Ronald Adam
































And as fa




In this case the encoding argument would only be needed not to













___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Adam Olsen
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 2/14/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> > In 3.0 it changes to:
> > "It's...".encode('utf-8')
> > u"It's...".byteencode('utf-8')  # Same as above, kept for compatibility
>
> No. 3.0 won't have "backward compatibility" features. That's the whole
> point of 3.0.

Conceded.


> > I realize it would be odd for the interactive interpret to print them
> > as a list of ints by default:
> > >>> u"It's...".byteencode('utf-8')
> > [73, 116, 39, 115, 46, 46, 46]
>
> No. This prints the repr() which should include the type. bytes([73,
> 116, 39, 115, 46, 46, 46]) is the right thing to print here.

Typo, sorry :)


--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Adam Olsen
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 2/13/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> > If I understand correctly there's three main candidates:
> > 1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x
>
> I'm not sure what you mean, but I'm guessing you're thinking that the
> repr() of a bytes object created from bytes('abc\xf0') would be
>
>   bytes('abc\xf0')
>
> under this rule. What's so bad about that?

See below.


> > 2. Direct copying to str/unicode if it's only ascii values, switching
> > to a list of hex literals if there's any non-ascii values
>
> That works for me too. But why hex literals? As MvL stated, a list of
> decimals would be just as useful.

PEBKAC.  Yeah, decimals are simpler and shorter even.


> > 3. b"foo" literal with ascii for all ascii characters (other than \
> > and "), \xFF for individual characters that aren't ascii
> >
> > Given the choice I prefer the third option, with the second option as
> > my runner up.  The first option just screams "silent errors" to me.
>
> The 3rd is out of the running for many reasons.
>
> I'm not sure I understand your "silent errors" fear; can you elaborate?

I think it's that someone will create a unicode object with real
latin-1 characters and it'll get passed through without errors, the
code assuming it's 8bit-as-latin-1.  If they had put other unicode
characters in they would have gotten an exception instead.

However, at this point all the posts on latin-1 encoding/decoding have
become so muddled in my mind that I don't know what they're
suggesting.  I think I'll wait for the pep to clear that up.

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Adam Olsen
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> Not entirely, since I don't know what b"abcdef" would mean
> (where  is a Unicode Euro character typed in whatever source
> encoding was used).

SyntaxError I would hope.  Ascii and hex escapes only please. :)

Although I'm not arguing for or against byte literals.  They do make
for a much terser form, but they're not strictly necessary.


--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-14 Thread Greg Ewing
Ron Adam wrote:

> My first impression and thoughts were:  (and seems incorrect now)
> 
>  bytes(object) ->  byte sequence of objects value
> 
> Basically a "memory dump" of objects value.

As I understand the current intentions, this is correct.
The bytes constructor would have two different signatures:

(1)   bytes(seq) --> interprets seq as a sequence of
 integers in the range 0..255,
 exception otherwise

(2a)  bytes(str, encoding) --> encodes the characters of
(2b)  bytes(unicode, encoding) the string using the specified
   encoding

In (2a) the string would be interpreted as containing
ascii characters, with an exception otherwise. In 3.0,
(2a) will disappear leaving only (1) and (2b).

> And I was thinking a bytes argument of more than one item would indicate 
> a byte sequence.
> 
>  bytes(1,2,3)  ->  bytes([1,2,3])

But then you have to test the argument in the one-argument
case and try to guess whether it should be interpreted as
a sequence or an integer. Best to avoid having to do that.

> Which is fine... so ???
> 
> b = bytes(0L) ->  bytes([0,0,0,0])

No, bytes(0L) --> TypeError because 0L doesn't implement
the iterator protocol or the buffer interface.

I suppose long integers might be enhanced to support the
buffer interface in 3.0, but that doesn't seem like a good
idea, because the bytes you got that way would depend on
the internal representation of long integers. In particular,

   bytes(0x12345678L)

via the buffer interface would most likely *not* give you
bytes[0x12, 0x34, 0x56, 0x78]).

Maybe types should grow a __bytes__ method?

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Stephen J. Turnbull
> "M" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

M> James Y Knight wrote:

>> Nice and simple.

M> Albeit, too simple.

M> The above approach would basically remove the possibility to
M> easily create bytes() from literals in Py3k, since literals in
M> Py3k create Unicode objects, e.g. bytes("123") would not work
M> in Py3k.

No, it just rules out a builtin easy way to create bytes() from
literals.

But who needs to do that?  codec writers and people implementing wire
protocols with bytes() that look like character strings but aren't.
OK, so this makes life hard on codec writers.  But those implementing
wire protocols can use existing codecs, presumably 'ascii' will do 99%
of the time:

def make_wire_token (unicode_string, encoding='ascii'):
return bytes(unicode_string.encode(encoding))

Everybody else is just asking for trouble by using bytes() for
character strings.  It would really be desirable to have "string" be a
Unicode literal in Py3k, and u"string" a syntax error.

M> To prevent [people from learning to write "bytes('string')" in
M> 2.x and expecting that to work in Py3k], you'd have to outrule
M> bytes() construction from strings altogether, which doesn't
M> look like a viable option either.

Why not?  Either bytes() are the same as strings, in which case why
change the name? or they're not, in which case we ask people to jump
through the required hoops to create them.  Maybe I'm missing some
huge use case, of course, but it looks to me like the use cases are
pretty specialized, and are likely to involve explicit coding anyway.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Just van Rossum
Guido van Rossum wrote:

> If bytes support the buffer interface, we get another interesting
> issue -- regular expressions over bytes. Brr.

We already have that:

  >>> import re, array
  >>> re.search('\2', array.array('B', [1, 2, 3, 4])).group()
  array('B', [2])
  >>> 

Not sure whether to blame array or re, though...

Just
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Bengt Richter
On Tue, 14 Feb 2006 12:31:07 -0700, Neil Schemenauer <[EMAIL PROTECTED]> wrote:

>On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote:
>> On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
>> > "\x80".encode('latin-1')
>> 
>> But in 2.5 we can't change that to return a bytes object without
>> creating HUGE incompatibilities.
>
>People could spell it bytes(s.encode('latin-1')) in order to make it
>work in 2.X.  That spelling would provide a way of ensuring the type
>of the return value.
UIAM spelling it
bytes(map(ord, s))
or
bytes(s)  # (bytes would do above internally)

would work for str or unicode and would be forward compatible.
or
bytes(s, encoding_name) # if standard mapping is not desired

BTW, ord(u'x') has the effect of u'x'.encode('latin-1')
Note:
 >>> s256 = ''.join(chr(i) for i in xrange(256))
 >>> assert s256.decode('latin-1') == u''.join(unichr(ord(c)) for c in s256)
 >>> assert map(ord, s256.decode('latin-1')) == map(ord, s256) == range(256)

But this does *not* mean bytes has an implicit encoding!! It just means
there is a useful 1:1 mapping between the possible bytes values and the
first 256 unicode *characters*, remembering that the latter are *characters*
quite apart from whatever encoding the code source may have.

This is a nice safe 1:1 abstract correspondence ISTM.
>
>> You missed the part where I said that introducing the bytes type
>> *without* a literal seems to be a good first step. A new type, even
>> built-in, is much less drastic than a new literal (which requires
>> lexer and parser support in addition to everything else).
>
>Are you concerned about the implementation effort?  If so, I don't
>think that's justified since adding a new string prefix should be
>pretty straightforward (relative to rest of the effort involved).
>Are you comfortable with the proposed syntax?
>

I'm -1 on special literal at this point. I think a special text-like literal
would be misleading, because it suggests that bytes is somehow in the
string family of types, which IMO it really isn't.
IMO it's semantically more of a builtin array.array('B').

If we adopt the ord/unichr mappings for strings to/from bytes, and
of course init also from a suitable integer sequence, we AGNI, I think.

Using non-ascii non-escaped characters in string literals for specifying
str ord values (as opposed to characters) is bad practice, but escaped
ascii-in-whatever-source-encoding and 
native_literal_in_source_encoding.decode(source_encoding)
seem to work:

 >>> for enc in 'cp437 latin-1 utf-8'.split():
 ... print '\n< %s >'%enc
 ... print mkretesc(enc, 0xf6)[1].decode(enc)
 ... print repr(mkretesc(enc, 0xf6)[1])
 ... print mkretesc(enc, 0xf6)[0]()
 ... t = mkretesc(enc, 0xf6)[0]()
 ... print t[0], t[1], t[2]
 ... print
 ...
 
 < cp437 >
 # -*- coding: cp437 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('cp437')
 
 "# -*- coding: cp437 -*-\ndef foof6(): return '\\xf6', '\x94', 
'\x94'.decode('cp437')\n"
 ('\xf6', '\x94', u'\xf6')
 ÷ ö ö
 
 
 < latin-1 >
 # -*- coding: latin-1 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('latin-1')
 
 "# -*- coding: latin-1 -*-\ndef foof6(): return '\\xf6', '\xf6', 
'\xf6'.decode('latin-1')\n"
 ('\xf6', '\xf6', u'\xf6')
 ÷ ÷ ö
 
 
 < utf-8 >
 # -*- coding: utf-8 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('utf-8')
 
 "# -*- coding: utf-8 -*-\ndef foof6(): return '\\xf6', '\xc3\xb6', 
'\xc3\xb6'.decode('utf-8')\n"
 
 ('\xf6', '\xc3\xb6', u'\xf6')
 ÷ +¦ ö
 
The source looks the same viewed as characters, but you can see the differences 
in the repr values.
But the consequence of source-encoding ord values determining str values is 
that if e.g. you imported
this foo function from variously encoded sources, only the escaped and unicode 
have the proper ord value.
The middle one comes from the native literal source encoding.

So until str becomes unicode, ascii or ascii escapes are a must for 
ord-specifying. Afer str becomes unicode,
escapes will still work, but the unichr/ord symmetry will allow using the full 
first 256 unicode characters
to specify byte type values if desired. (This happens to correspond to latin-1, 
but don't mention it ;-)

It would make possible a round-trippable repr as bytes('...')
using ascii+escaped ascii, and full-256 unicode string literals 
backwards-compatibly after py3k.
Have I missed a pitfall? Hope the output got through to your screen. The first 
and last in the 3-character
lines should always be division sign and umlaut o. The problematical middle 
ones should be cp437 translations
of the middle hex values, since that is the screen I copied from (umluat o, 
division sign, and plus, vertical_bar
for the translation of the utf-8 encoding pair. That one illustrates the 
problem of returning a "character"
encoded in utf-8 thinking single-byte ord value.).

BTW, should bytes be freezable?

Regards,
Bengt Richter

_

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Ron Adam
Greg Ewing wrote:
> Ron Adam wrote:
> 
>> My first impression and thoughts were:  (and seems incorrect now)
>>
>>  bytes(object) ->  byte sequence of objects value
>>
>> Basically a "memory dump" of objects value.
> 
> As I understand the current intentions, this is correct.
> The bytes constructor would have two different signatures:
> 
> (1)   bytes(seq) --> interprets seq as a sequence of
>  integers in the range 0..255,
>  exception otherwise
> 
> (2a)  bytes(str, encoding) --> encodes the characters of
> (2b)  bytes(unicode, encoding) the string using the specified
>encoding
> 
> In (2a) the string would be interpreted as containing
> ascii characters, with an exception otherwise. In 3.0,
> (2a) will disappear leaving only (1) and (2b).

I was presuming it would be done in C code and it will just need a 
pointer to the first byte, memchr(), and then read n bytes directly into 
a new memory range via  memcpy(). But I don't know if that's possible 
with Pythons object model.  (My C skills are a bit rusty as well)

However, if it's done with a Python iterator and then each item is 
translated to bytes in a sequence, (much slower), an encoding will need 
to be known for it to work correctly.  Unfortunately Unicode strings 
don't set an attribute to indicate it's own encoding. So bytes() can't 
just do encoding = s.encoding to find out, it would need to be specified 
in this case.

And that should give you a byte object that is equivalent to the bytes 
in memory, providing Python doesn't compress data internally to save 
space. (?, I don't think it does)

I'd prefer the first version *if possible* because of the performance.

>> And I was thinking a bytes argument of more than one item would indicate 
>> a byte sequence.
>>
>>  bytes(1,2,3)  ->  bytes([1,2,3])
> 
> But then you have to test the argument in the one-argument
> case and try to guess whether it should be interpreted as
> a sequence or an integer. Best to avoid having to do that.

Yes, I agree.

>> Which is fine... so ???
>>
>> b = bytes(0L) ->  bytes([0,0,0,0])
> 
> No, bytes(0L) --> TypeError because 0L doesn't implement
> the iterator protocol or the buffer interface.

It wouldn't need it if it was a direct C memory copy.

> I suppose long integers might be enhanced to support the
> buffer interface in 3.0, but that doesn't seem like a good
> idea, because the bytes you got that way would depend on
> the internal representation of long integers. In particular,

Since some longs will be of different length, yes a bytes(0L) could give 
differing results on different platforms, but it will always give the 
same result on the platform it is run on. I actually think this is a 
plus and not a problem. If you are using Python to implement a byte 
interface you need to *know* it is different, not have it hidden.

 bytesize = len(bytes(0L))  # find how long a long is


Cheers,
   Ronald Adam


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Adam Olsen
On 2/15/06, Ron Adam <[EMAIL PROTECTED]> wrote:
> Greg Ewing wrote:
> > Ron Adam wrote:
> >> b = bytes(0L) ->  bytes([0,0,0,0])
> >
> > No, bytes(0L) --> TypeError because 0L doesn't implement
> > the iterator protocol or the buffer interface.
>
> It wouldn't need it if it was a direct C memory copy.
>
> > I suppose long integers might be enhanced to support the
> > buffer interface in 3.0, but that doesn't seem like a good
> > idea, because the bytes you got that way would depend on
> > the internal representation of long integers. In particular,
>
> Since some longs will be of different length, yes a bytes(0L) could give
> differing results on different platforms, but it will always give the
> same result on the platform it is run on. I actually think this is a
> plus and not a problem. If you are using Python to implement a byte
> interface you need to *know* it is different, not have it hidden.
>
>  bytesize = len(bytes(0L))  # find how long a long is

I believe you're confusing a C long with a Python long.  A Python long
is implemented as an array and has variable size.

In any case we already have the struct module:

>>> import struct
>>> struct.calcsize('l')
4

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Bengt Richter
On Tue, 14 Feb 2006 15:14:07 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote:

>On 2/14/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> Guido van Rossum wrote:
>> > As Phillip guessed, I was indeed thinking about introducing bytes()
>> > sooner than that, perhaps even in 2.5 (though I don't want anything
>> > rushed).
>>
>> Hmm, that is probably going to be too early. As the thread shows
>> there are lots of things to take into account, esp. since if you
>> plan to introduce bytes() in 2.x, the upgrade path to 3.x would
>> have to be carefully planned. Otherwise, we end up introducing
>> a feature which is meant to prepare for 3.x and then we end up
>> causing breakage when the move is finally implemented.
>
>You make a good point. Someone probably needs to write up a new PEP
>summarizing this discussion (or rather, consolidating the agreement
>that is slowly emerging, where there is agreement, and summarizing the
>key open questions).
>
>> > Even in Py3k though, the encoding issue stands -- what if the file
>> > encoding is Unicode? Then using Latin-1 to encode bytes by default
>> > might not by what the user expected. Or what if the file encoding is
>> > something totally different? (Cyrillic, Greek, Japanese, Klingon.)
>> > Anything default but ASCII isn't going to work as expected. ASCII
>> > isn't going to work as expected either, but it will complain loudly
>> > (by throwing a UnicodeError) whenever you try it, rather than causing
>> > subtle bugs later.
>>
>> I think there's a misunderstanding here: in Py3k, all "string"
>> literals will be converted from the source code encoding to
>> Unicode. There are no ambiguities - a Klingon character will still
>> map to the same ordinal used to create the byte content regardless
>> of whether the source file is encoded in UTF-8, UTF-16 or
>> some Klingon charset (are there any ?).
>
>OK, so a string (literal or otherwise) containing a Klingon character
>won't be acceptable to the bytes() constructor in 3.0. It shouldn't be
>in 2.x either then.
>
>I still think that someone who types a file in Latin-1 and enters
>non-ASCII Latin-1 characters in a string literal and then passes it to
>the bytes() constructor might expect to get bytes encoded in Latin-1,
>and someone who types a file in UTF-8 and enters non-ASCII Unicode
>characters might expect to get UTF-8-encoded bytes. Since they can't
>both get what they want, we should disallow both, and only allow
>ASCII.
ISTM this is a good rule for backwards compatibility for the
'...' => u'...' py3k transition. I don't know if you saw my other post,
but I was suggesting that bytes(s_or_u) should be mapped to the integer
values by the current definition of ord for either str or unicode.
UIAM this works when you convert ASCII and will work if you convert
the ASCII string to unicode.

It will also let you use unicode _currently_ to get past the ASCII restriction,
since ord(u) works for all of the first 256 unicode characters.
Using those characters in bytes(u'...') works even if your source encoding is 
utf-8
and contains ascii escapes, e.g.

 >>> utfsrc = """\
 ... # -*- coding: utf-8 -*-
 ... umlaut_os, values = u'\xf6\\xf6', map(ord, u'\xf6\\xf6')
 ... """.decode('latin-1').encode('utf-8')

Hopefully showing on your screen properly:

 >>> print utfsrc.decode('utf-8')
 # -*- coding: utf-8 -*-
 umlaut_os, values = u'ö\xf6', map(ord, u'ö\xf6')

And the repr, where you can see the utf-8 double chars for utf-8 and the \\xf6 
ascii escape:

 >>> print repr(utfsrc)
 "# -*- coding: utf-8 -*-\numlaut_os, values = u'\xc3\xb6\\xf6', map(ord, 
u'\xc3\xb6\\xf6')\n"

compiling the utf-8 source and executing it:

 >>> exec compile(utfsrc,'','exec')

Good results:

 >>> umlaut_os, map(hex, values)
 (u'\xf6\xf6', ['0xf6', '0xf6'])
 >>> print umlaut_os
 öö

So map(s_or_u) works predictably now, and will not break after py3k
unless you use non-ascii in _plain_ str strings now. But in unicode it
should be ok even now.

I think ord is a consistent and handy mapping of characters to bytes,
and the fact that it works for unicode for all 256 characters seems to me
a boon. (So long as no one gets upset that ord(u) _happens_
to match ord(u.encode('latin-1')) ;-)

I didn't see yet where you had ruled against ord mapping of unicode to bytes,
so I am hopeful that you will consider it.

>> Furthermore, by restricting to ASCII you'd also outrule hex escapes
>> which seem to be the natural choice for presenting binary data in
>> literals - the Unicode representation would then only be an
>> implementation detail of the way Python treats "string" literals
>> and a user would certainly expect to find e.g. \x88 in the bytes object
>> if she writes bytes('\x88').
>
>I guess we'l just have to disappoint her. Too bad for the person who
>wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to
>write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO
>and certainly easier than a *mixture* of hex and ASCII like
>'\xabc\xdef'.
>
>> 

Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Jim Jewett
On 2/14/06, Neil Schemenauer wrote:
> People could spell it bytes(s.encode('latin-1'))

Guido wrote:
> At the cost of an extra copying step.

I asked:
> ... why not just add some smarts to the bytes constructor?

Guido wrote:

> ... the VM usually keeps an extra reference
> on the stack so the refcount is never 1. But
> you can't rely on that

I did miss this, but _PyString_Resize seems to
work around it, and I'm not sure that the bytes
object can't be just as intimate.

Even if that is insurmountable, bytes objects
could recognize two states -- one normal, and
one for "I'm delegating to a string, and have to
copy to my own buffer before I actually mutate
anything."

Then a new bytes object would still need its
own header, but the data copying could often
be avoided.

But back to the possibility of not creating
even a new object header...
> the str's underlying array is allocated inline
> with the str header, this require str and
> bytes to have the same object layout. But
> since bytes are mutable, they can't.

Looking at the arraymodule, the only extra
fields in an array are weakrefs, description
(which will no longer be needed) and tracking
for the indirection.  There are even a few extra
bytes leftover that could be used to indicate
that ob_item was redirected later, the way
tables do with small_table.

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Josiah Carlson

Ron Adam <[EMAIL PROTECTED]> wrote:
> Greg Ewing wrote:
> > Ron Adam wrote:
> >> b = bytes(0L) ->  bytes([0,0,0,0])
> > 
> > No, bytes(0L) --> TypeError because 0L doesn't implement
> > the iterator protocol or the buffer interface.
> 
> It wouldn't need it if it was a direct C memory copy.

Yes it would.  Python long integers are stored as arrays of signed
16-bit short ints.  See longintrepr.h from the source.


 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Thomas Wouters
On Wed, Feb 15, 2006 at 01:38:41PM -0500, Jim Jewett wrote:
> On 2/14/06, Neil Schemenauer wrote:
> > People could spell it bytes(s.encode('latin-1'))
> 
> Guido wrote:
> > At the cost of an extra copying step.
> 
> I asked:
> > ... why not just add some smarts to the bytes constructor?
> 
> Guido wrote:
> 
> > ... the VM usually keeps an extra reference
> > on the stack so the refcount is never 1. But
> > you can't rely on that
> 
> I did miss this, but _PyString_Resize seems to
> work around it, and I'm not sure that the bytes
> object can't be just as intimate.

No, _PyString_Resize doesn't work around it. _PyString_Resize only works if
the refcount is exactly one: only the caller has a reference. And by
'caller', I mean 'the calling C function'. Besides that, the caller takes
care to only use _PyString_Resize on strings it created itself.
Theoretically it could 'steal' a reference from someplace else, but I
haven't seen _PyString_Resize-using code do that, and it would be a recipe
for disaster.

-- 
Thomas Wouters <[EMAIL PROTECTED]>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Greg Ewing
Ron Adam wrote:

> I was presuming it would be done in C code and it will just need a 
> pointer to the first byte, memchr(), and then read n bytes directly into 
> a new memory range via  memcpy().

If the object supports the buffer interface, it can be
done that way. But if not, it would seem to make sense to
fall back on the iterator protocol.

> However, if it's done with a Python iterator and then each item is 
> translated to bytes in a sequence, (much slower), an encoding will need 
> to be known for it to work correctly.

No, it won't. When using the bytes(x) form, encoding has
nothing to do with it. It's purely a conversion from one
representation of an array of 0..255 to another.

When you *do* want to perform encoding, you use
bytes(u, encoding) and say what encoding you want
to use.

> Unfortunately Unicode strings 
> don't set an attribute to indicate it's own encoding.

I think you don't understand what an encoding is. Unicode
strings don't *have* an encoding, because theyre not encoded!
Encoding is what happens when you go from a unicode string
to something else.

> Since some longs will be of different length, yes a bytes(0L) could give 
> differing results on different platforms,

It's not just a matter of length. I'm not sure of the
details, but I believe longs are currently stored as an
array of 16-bit chunks, of which only 15 bits are used.
I'm having trouble imagining a use for low-level access
to that format, other than just treating it as an opaque
lump of data for turning back into a long later -- in
which case why not just leave it as a long in the first
place.

Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Ron Adam
Greg Ewing wrote:

> I think you don't understand what an encoding is. Unicode
> strings don't *have* an encoding, because theyre not encoded!
> Encoding is what happens when you go from a unicode string
> to something else.

Ah.. ok, my mental picture was a bit off.  I had this reversed somewhat.


> It's not just a matter of length. I'm not sure of the
> details, but I believe longs are currently stored as an
> array of 16-bit chunks, of which only 15 bits are used.
> I'm having trouble imagining a use for low-level access
> to that format, other than just treating it as an opaque
> lump of data for turning back into a long later -- in
> which case why not just leave it as a long in the first
> place.

I had laps thinking Pythons longs are the same as c longs. I know 
Pythons longs can get much much bigger.

The idea was to be able to show the byte data as is in what ever form it 
takes and not try to change it, weather it's longs, floats, strings, etc.

Cheers,
 Ron




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Aahz
On Tue, Feb 14, 2006, Guido van Rossum wrote:
>
> Anyway, I'm now convinced that bytes should act as an array of ints,
> where the ints are restricted to range(0, 256) but have type int.

range(0, 255)?
-- 
Aahz ([EMAIL PROTECTED])   <*> http://www.pythoncraft.com/

"19. A language that doesn't affect the way you think about programming,
is not worth knowing."  --Alan Perlis
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Bob Ippolito

On Feb 15, 2006, at 6:35 PM, Aahz wrote:

> On Tue, Feb 14, 2006, Guido van Rossum wrote:
>>
>> Anyway, I'm now convinced that bytes should act as an array of ints,
>> where the ints are restricted to range(0, 256) but have type int.
>
> range(0, 255)?

No, Guido was correct.  range(0, 256) is [0, 1, 2, ..., 255].

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

2006-02-15 Thread Aahz
On Wed, Feb 15, 2006, Bob Ippolito wrote:
> On Feb 15, 2006, at 6:35 PM, Aahz wrote:
>> On Tue, Feb 14, 2006, Guido van Rossum wrote:
>>>
>>> Anyway, I'm now convinced that bytes should act as an array of ints,
>>> where the ints are restricted to range(0, 256) but have type int.
>>
>> range(0, 255)?
> 
> No, Guido was correct.  range(0, 256) is [0, 1, 2, ..., 255].

My mistake -- I wasn't thinking of the literal Python function.
-- 
Aahz ([EMAIL PROTECTED])   <*> http://www.pythoncraft.com/

"19. A language that doesn't affect the way you think about programming,
is not worth knowing."  --Alan Perlis
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [Was:Re: release plan for 2.5 ?]

2006-02-27 Thread Fredrik Lundh
Just van Rossum wrote:

> > If bytes support the buffer interface, we get another interesting
> > issue -- regular expressions over bytes. Brr.
>
> We already have that:
>
>   >>> import re, array
>   >>> re.search('\2', array.array('B', [1, 2, 3, 4])).group()
>   array('B', [2])
>   >>>
>
> Not sure whether to blame array or re, though...

SRE.  iirc, the design rationale was to support RE over mmap'ed regions.





___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com