[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote: >> On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL >> PROTECTED]> > >The backwards compatibility problems *seem* to be relatively >> minor. >> >I only found one instance of breakage in the standard library. Note >> >that my patch does not change PyObject_Str(); that would break >> >massive amounts of code. Instead, I introduce a new function: >> >PyString_New(). I'm not crazy about the name but I couldn't think >> >of anything better. > >On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote: >> Should this not be coordinated with PEP 332? > >Probably.. But that PEP is rather incomplete. Wanna work on fixing that? > I'd be glad to add my thoughts, but first of course it's Skip's PEP, and Martin casts a long shadow when it comes to character coding issues that I suspect will have to be considered. (E.g., if there is a b'...' literal for bytes, the actual characters of the source code itself that the literal is being expressed in could be ascii or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the source is at least temporarily normalized to Unicode, and then re-encoded (except now for string literals?) per coding cookie or other encoding inference. (I may be out of date, gotta catch up). If one way or the other a string literal is in Unicode, then presumably so is a byte string b'...' literal -- i.e. internally u"b'...'" just before being turned into bytes. Should that then be an internal straight u"b'...'".encode('byte') with default ascii + escapes for non-ascii and non-printables, to define the full 8 bits without encoding error? Should unicode be encodable into byte via a specific encoding? E.g., u'abc'.encode('byte','latin1'), to distinguish producing a mutable byte string vs an immutable str type as with u'abc'.encode('latin1'). (but how does this play with str being able to produce unicode? And when do these changes happen?) I guess I'm getting ahead of myself ;-) So I would first ask Skip what he'd like to do, and Martin for some hints on reading, to avoid going down paths he already knows lead to brick walls ;-) And I need to think more about PEP 349. I would propose to do the reading they suggest, and edit up a new version of pep-0332.txt that anyone could then improve further. I don't know about an early deadline. I don't want to over-commit, as time and energies vary. OTOH, as you've noticed, I could be spending my time more effectively ;-) I changed the thread title, and will wait for some signs from you, Skip, Martin, Neil, and I don't know who else might be interested... Regards, Bengt Richter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Neil Schemenauer wrote: > People could spell it bytes(s.encode('latin-1')) in order to make it > work in 2.X. Guido wrote: > At the cost of an extra copying step. That sounds like an implementation issue. If it is important enough to matter, then why not just add some smarts to the bytes constructor? If the argument is a str, and the constructor owns the only reference, then go ahead and use the argument's own underlying array; the string itself will be deallocated when (or before) the constructor returns, so no one else can use it expecting an immutable. -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
One recommendation: for starters, I'd much rather see the bytes type standardized without a literal notation. There should be are lots of ways to create bytes objects from string objects, with specific explicit encodings, and those should suffice, at least initially. I also wonder if having a b"..." literal would just add more confusion -- bytes are not characters, but b"..." makes it appear as if they are. --Guido On 2/11/06, Bengt Richter <[EMAIL PROTECTED]> wrote: > On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> > wrote: > > >> On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL > >> PROTECTED]> > >The backwards compatibility problems *seem* to be > >> relatively minor. > >> >I only found one instance of breakage in the standard library. Note > >> >that my patch does not change PyObject_Str(); that would break > >> >massive amounts of code. Instead, I introduce a new function: > >> >PyString_New(). I'm not crazy about the name but I couldn't think > >> >of anything better. > > > >On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote: > >> Should this not be coordinated with PEP 332? > > > >Probably.. But that PEP is rather incomplete. Wanna work on fixing that? > > > I'd be glad to add my thoughts, but first of course it's Skip's PEP, > and Martin casts a long shadow when it comes to character coding issues > that I suspect will have to be considered. > > (E.g., if there is a b'...' literal for bytes, the actual characters of > the source code itself that the literal is being expressed in could be ascii > or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the > source > is at least temporarily normalized to Unicode, and then re-encoded (except now > for string literals?) per coding cookie or other encoding inference. (I may be > out of date, gotta catch up). > > If one way or the other a string literal is in Unicode, then presumably so is > a byte string b'...' literal -- i.e. internally u"b'...'" just before > being turned into bytes. > > Should that then be an internal straight u"b'...'".encode('byte') with > default ascii + escapes > for non-ascii and non-printables, to define the full 8 bits without encoding > error? > Should unicode be encodable into byte via a specific encoding? E.g., > u'abc'.encode('byte','latin1'), > to distinguish producing a mutable byte string vs an immutable str type as > with u'abc'.encode('latin1'). > (but how does this play with str being able to produce unicode? And when do > these changes happen?) > I guess I'm getting ahead of myself ;-) > > So I would first ask Skip what he'd like to do, and Martin for some hints on > reading, to avoid > going down paths he already knows lead to brick walls ;-) And I need to think > more about PEP 349. > > I would propose to do the reading they suggest, and edit up a new version of > pep-0332.txt > that anyone could then improve further. I don't know about an early deadline. > I don't want > to over-commit, as time and energies vary. OTOH, as you've noticed, I could > be spending my > time more effectively ;-) > > I changed the thread title, and will wait for some signs from you, Skip, > Martin, Neil, and I don't > know who else might be interested... > > Regards, > Bengt Richter > > ___ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/guido%40python.org > -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > One recommendation: for starters, I'd much rather see the bytes type > standardized without a literal notation. There should be are lots of > ways to create bytes objects from string objects, with specific > explicit encodings, and those should suffice, at least initially. > > I also wonder if having a b"..." literal would just add more confusion > -- bytes are not characters, but b"..." makes it appear as if they > are. Agreed. Given that we have a source code encoding which would need to be honored, b"..." doesn't really make all that much sense (unless you always use hex escapes). Note that if we drop the string type, all codecs which currently return strings will have to return bytes. This gives you a pretty exhaustive way of defining your binary literals in Python :-) Here's one: data = "abc".encode("latin-1") To simplify things we might want to have bytes("abc") do the above encoding per default. > --Guido > > On 2/11/06, Bengt Richter <[EMAIL PROTECTED]> wrote: >> On Fri, 10 Feb 2006 21:35:26 -0800, Guido van Rossum <[EMAIL PROTECTED]> >> wrote: >> On Sat, 11 Feb 2006 05:08:09 + (UTC), Neil Schemenauer <[EMAIL PROTECTED]> > >The backwards compatibility problems *seem* to be relatively minor. > I only found one instance of breakage in the standard library. Note > that my patch does not change PyObject_Str(); that would break > massive amounts of code. Instead, I introduce a new function: > PyString_New(). I'm not crazy about the name but I couldn't think > of anything better. >>> On 2/10/06, Bengt Richter <[EMAIL PROTECTED]> wrote: Should this not be coordinated with PEP 332? >>> Probably.. But that PEP is rather incomplete. Wanna work on fixing that? >>> >> I'd be glad to add my thoughts, but first of course it's Skip's PEP, >> and Martin casts a long shadow when it comes to character coding issues >> that I suspect will have to be considered. >> >> (E.g., if there is a b'...' literal for bytes, the actual characters of >> the source code itself that the literal is being expressed in could be ascii >> or latin-1 or utf-8 or utf16le a la Microsoft, etc. UIAM, I read that the >> source >> is at least temporarily normalized to Unicode, and then re-encoded (except >> now >> for string literals?) per coding cookie or other encoding inference. (I may >> be >> out of date, gotta catch up). >> >> If one way or the other a string literal is in Unicode, then presumably so is >> a byte string b'...' literal -- i.e. internally u"b'...'" just before >> being turned into bytes. >> >> Should that then be an internal straight u"b'...'".encode('byte') with >> default ascii + escapes >> for non-ascii and non-printables, to define the full 8 bits without encoding >> error? >> Should unicode be encodable into byte via a specific encoding? E.g., >> u'abc'.encode('byte','latin1'), >> to distinguish producing a mutable byte string vs an immutable str type as >> with u'abc'.encode('latin1'). >> (but how does this play with str being able to produce unicode? And when do >> these changes happen?) >> I guess I'm getting ahead of myself ;-) >> >> So I would first ask Skip what he'd like to do, and Martin for some hints on >> reading, to avoid >> going down paths he already knows lead to brick walls ;-) And I need to >> think more about PEP 349. >> >> I would propose to do the reading they suggest, and edit up a new version of >> pep-0332.txt >> that anyone could then improve further. I don't know about an early >> deadline. I don't want >> to over-commit, as time and energies vary. OTOH, as you've noticed, I could >> be spending my >> time more effectively ;-) >> >> I changed the thread title, and will wait for some signs from you, Skip, >> Martin, Neil, and I don't >> know who else might be interested... >> >> Regards, >> Bengt Richter >> >> ___ >> Python-Dev mailing list >> Python-Dev@python.org >> http://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> http://mail.python.org/mailman/options/python-dev/guido%40python.org >> > > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > ___ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/mal%40egenix.com -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 13 2006) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-D
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote: >One recommendation: for starters, I'd much rather see the bytes type >standardized without a literal notation. There should be are lots of >ways to create bytes objects from string objects, with specific >explicit encodings, and those should suffice, at least initially. > >I also wonder if having a b"..." literal would just add more confusion >-- bytes are not characters, but b"..." makes it appear as if they >are. Why not just have the constructor be: bytes(initializer [,encoding]) Where initializer must be either an iterable of suitable integers, or a unicode/string object. If the latter (i.e., it's a basestring), the encoding argument would then be required. Then, there's no need for special codec support for the bytes type, since you call bytes on the thing to be encoded. And of course, no need for a 'b' literal. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote: > >One recommendation: for starters, I'd much rather see the bytes type > >standardized without a literal notation. There should be are lots of > >ways to create bytes objects from string objects, with specific > >explicit encodings, and those should suffice, at least initially. > > > >I also wonder if having a b"..." literal would just add more confusion > >-- bytes are not characters, but b"..." makes it appear as if they > >are. > > Why not just have the constructor be: > > bytes(initializer [,encoding]) > > Where initializer must be either an iterable of suitable integers, or a > unicode/string object. If the latter (i.e., it's a basestring), the > encoding argument would then be required. Then, there's no need for > special codec support for the bytes type, since you call bytes on the thing > to be encoded. And of course, no need for a 'b' literal. It'd be cruel and unusual punishment though to have to write bytes("abc", "Latin-1") I propose that the default encoding (for basestring instances) ought to be "ascii" just like everywhere else. (Meaning, it should really be the system default encoding, which defaults to "ascii" and is intentionally hard to change.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: >> At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote: >>> One recommendation: for starters, I'd much rather see the bytes type >>> standardized without a literal notation. There should be are lots of >>> ways to create bytes objects from string objects, with specific >>> explicit encodings, and those should suffice, at least initially. >>> >>> I also wonder if having a b"..." literal would just add more confusion >>> -- bytes are not characters, but b"..." makes it appear as if they >>> are. >> Why not just have the constructor be: >> >> bytes(initializer [,encoding]) >> >> Where initializer must be either an iterable of suitable integers, or a >> unicode/string object. If the latter (i.e., it's a basestring), the >> encoding argument would then be required. Then, there's no need for >> special codec support for the bytes type, since you call bytes on the thing >> to be encoded. And of course, no need for a 'b' literal. > > It'd be cruel and unusual punishment though to have to write > > bytes("abc", "Latin-1") > > I propose that the default encoding (for basestring instances) ought > to be "ascii" just like everywhere else. (Meaning, it should really be > the system default encoding, which defaults to "ascii" and is > intentionally hard to change.) We're talking about Py3k here: "abc" will be a Unicode string, so why restrict the conversion to 7 bits when you can have 8 bits without any conversion problems ? While we're at it: I'd suggest that we remove the auto-conversion from bytes to Unicode in Py3k and the default encoding along with it. In Py3k the standard lib will have to be Unicode compatible anyway and string parser markers like "s#" will have to go away as well, so there's not much need for this anymore. (Maybe a bit radical, but I guess that's what Py3k is meant for.) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 13 2006) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 10:55 PM 2/13/2006 +0100, M.-A. Lemburg wrote: >Guido van Rossum wrote: > > On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > >> At 09:55 AM 2/13/2006 -0800, Guido van Rossum wrote: > >>> One recommendation: for starters, I'd much rather see the bytes type > >>> standardized without a literal notation. There should be are lots of > >>> ways to create bytes objects from string objects, with specific > >>> explicit encodings, and those should suffice, at least initially. > >>> > >>> I also wonder if having a b"..." literal would just add more confusion > >>> -- bytes are not characters, but b"..." makes it appear as if they > >>> are. > >> Why not just have the constructor be: > >> > >> bytes(initializer [,encoding]) > >> > >> Where initializer must be either an iterable of suitable integers, or a > >> unicode/string object. If the latter (i.e., it's a basestring), the > >> encoding argument would then be required. Then, there's no need for > >> special codec support for the bytes type, since you call bytes on the > thing > >> to be encoded. And of course, no need for a 'b' literal. > > > > It'd be cruel and unusual punishment though to have to write > > > > bytes("abc", "Latin-1") > > > > I propose that the default encoding (for basestring instances) ought > > to be "ascii" just like everywhere else. (Meaning, it should really be > > the system default encoding, which defaults to "ascii" and is > > intentionally hard to change.) > >We're talking about Py3k here: "abc" will be a Unicode string, >so why restrict the conversion to 7 bits when you can have 8 bits >without any conversion problems ? Actually, I thought we were talking about adding bytes() in 2.5. However, now that you've brought this up, it actually makes perfect sense to just use latin-1 as the effective encoding for both strings and unicode. In Python 2.x, strings are byte strings by definition, so it's only in 3.0 that an encoding would be required. And again, latin1 is a reasonable, roundtrippable default encoding. So, it sounds like making the encoding default to latin-1 would be a reasonably safe approach in both 2.x and 3.x. >While we're at it: I'd suggest that we remove the auto-conversion >from bytes to Unicode in Py3k and the default encoding along with >it. In Py3k the standard lib will have to be Unicode compatible >anyway and string parser markers like "s#" will have to go away >as well, so there's not much need for this anymore. I thought all this was already in the plan for 3.0, but maybe I assume too much. :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Phillip J. Eby wrote: Why not just have the constructor be: bytes(initializer [,encoding]) Where initializer must be either an iterable of suitable integers, or a unicode/string object. If the latter (i.e., it's a basestring), the encoding argument would then be required. Then, there's no need for special codec support for the bytes type, since you call bytes on the >> thing to be encoded. And of course, no need for a 'b' literal. >>> It'd be cruel and unusual punishment though to have to write >>> >>> bytes("abc", "Latin-1") >>> >>> I propose that the default encoding (for basestring instances) ought >>> to be "ascii" just like everywhere else. (Meaning, it should really be >>> the system default encoding, which defaults to "ascii" and is >>> intentionally hard to change.) >> We're talking about Py3k here: "abc" will be a Unicode string, >> so why restrict the conversion to 7 bits when you can have 8 bits >> without any conversion problems ? > > Actually, I thought we were talking about adding bytes() in 2.5. Then we'd need to make the "ascii" encoding assumption again, just like Guido proposed. > However, now that you've brought this up, it actually makes perfect sense > to just use latin-1 as the effective encoding for both strings and > unicode. In Python 2.x, strings are byte strings by definition, so it's > only in 3.0 that an encoding would be required. And again, latin1 is a > reasonable, roundtrippable default encoding. It is. However, it's not a reasonable assumption of the default encoding since there are many encodings out there that special case the characters 0x80-0xFF, hence the choice of using ASCII as default encoding in Python. The conversion from Unicode to bytes is different in this respect, since you are converting from a "bigger" type to a "smaller" one. Choosing latin-1 as default for this conversion would give you all 8 bits, instead of just 7 bits that ASCII provides. > So, it sounds like making the encoding default to latin-1 would be a > reasonably safe approach in both 2.x and 3.x. Reasonable for bytes(): yes. In general: no. >> While we're at it: I'd suggest that we remove the auto-conversion >>from bytes to Unicode in Py3k and the default encoding along with >> it. In Py3k the standard lib will have to be Unicode compatible >> anyway and string parser markers like "s#" will have to go away >> as well, so there's not much need for this anymore. > > I thought all this was already in the plan for 3.0, but maybe I assume too > much. :) Wouldn't want to wait for Py4D :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 13 2006) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > It'd be cruel and unusual punishment though to have to write > > > > bytes("abc", "Latin-1") > > > > I propose that the default encoding (for basestring instances) ought > > to be "ascii" just like everywhere else. (Meaning, it should really be > > the system default encoding, which defaults to "ascii" and is > > intentionally hard to change.) > > We're talking about Py3k here: "abc" will be a Unicode string, > so why restrict the conversion to 7 bits when you can have 8 bits > without any conversion problems ? As Phillip guessed, I was indeed thinking about introducing bytes() sooner than that, perhaps even in 2.5 (though I don't want anything rushed). Even in Py3k though, the encoding issue stands -- what if the file encoding is Unicode? Then using Latin-1 to encode bytes by default might not by what the user expected. Or what if the file encoding is something totally different? (Cyrillic, Greek, Japanese, Klingon.) Anything default but ASCII isn't going to work as expected. ASCII isn't going to work as expected either, but it will complain loudly (by throwing a UnicodeError) whenever you try it, rather than causing subtle bugs later. > While we're at it: I'd suggest that we remove the auto-conversion > from bytes to Unicode in Py3k and the default encoding along with > it. I'm not sure which auto-conversion you're talking about, since there is no bytes type yet. If you're talking about the auto-conversion from str to unicode: the bytes type should not be assumed to have *any* properties that the current str type has, and that includes auto-conversion. > In Py3k the standard lib will have to be Unicode compatible > anyway and string parser markers like "s#" will have to go away > as well, so there's not much need for this anymore. > > (Maybe a bit radical, but I guess that's what Py3k is meant for.) Right. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > Actually, I thought we were talking about adding bytes() in 2.5. I was. > However, now that you've brought this up, it actually makes perfect sense > to just use latin-1 as the effective encoding for both strings and > unicode. In Python 2.x, strings are byte strings by definition, so it's > only in 3.0 that an encoding would be required. And again, latin1 is a > reasonable, roundtrippable default encoding. > > So, it sounds like making the encoding default to latin-1 would be a > reasonably safe approach in both 2.x and 3.x. I disagree. IMO the same reasons why we don't do this now for the conversion between str and unicode stands for bytes. > >While we're at it: I'd suggest that we remove the auto-conversion > >from bytes to Unicode in Py3k and the default encoding along with > >it. In Py3k the standard lib will have to be Unicode compatible > >anyway and string parser markers like "s#" will have to go away > >as well, so there's not much need for this anymore. I don't know yet what the C API will look like in 3.0. But it may well have to support auto-conversion from Unicode to char* using some system default encoding (e.g. the Windows default code page?) in order to be able to conveniently wrap OS APIs that use char* instead of some sort of Unicode (and each OS has its own way of interpreting char* as Unicode -- I believe Apple uses UTF-8?). > I thought all this was already in the plan for 3.0, but maybe I assume too > much. :) In Py3k, I can see two reasonable approaches to conversion between strings (Unicode) and bytes: always require an explicit encoding, or assume ASCII. Anything else is asking for trouble IMO. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 12:03 AM 2/14/2006 +0100, M.-A. Lemburg wrote: >The conversion from Unicode to bytes is different in this >respect, since you are converting from a "bigger" type to >a "smaller" one. Choosing latin-1 as default for this >conversion would give you all 8 bits, instead of just 7 >bits that ASCII provides. I was just pointing out that since byte strings are bytes by definition, then simply putting those bytes in a bytes() object doesn't alter the existing encoding. So, using latin-1 when converting a string to bytes actually seems like the the One Obvious Way to do it. I'm so accustomed to being wary of encoding issues that the idea doesn't *feel* right at first - I keep going, "but you can't know what encoding those bytes are". Then I go, Duh, that's the point. If you convert str->bytes, there's no conversion and no interpretation - neither the str nor the bytes object knows its encoding, and that's okay. So str(bytes_object) (in 2.x) should also just turn it back to a normal bytestring. In fact, the 'encoding' argument seems useless in the case of str objects, and it seems it should default to latin-1 for unicode objects. The only use I see for having an encoding for a 'str' would be to allow confirming that the input string in fact is valid for that encoding. So, "bytes(some_str,'ascii')" would be an assertion that some_str must be valid ASCII. > > So, it sounds like making the encoding default to latin-1 would be a > > reasonably safe approach in both 2.x and 3.x. > >Reasonable for bytes(): yes. In general: no. Right, I was only talking about bytes(). For 3.0, the type formerly known as "str" won't exist, so only the Unicode part will be relevant then. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > At 12:03 AM 2/14/2006 +0100, M.-A. Lemburg wrote: > >The conversion from Unicode to bytes is different in this > >respect, since you are converting from a "bigger" type to > >a "smaller" one. Choosing latin-1 as default for this > >conversion would give you all 8 bits, instead of just 7 > >bits that ASCII provides. > > I was just pointing out that since byte strings are bytes by definition, > then simply putting those bytes in a bytes() object doesn't alter the > existing encoding. So, using latin-1 when converting a string to bytes > actually seems like the the One Obvious Way to do it. This actually makes some sense -- bytes(s) where isinstance(s, str) should just copy the data, since we can't know what encoding the user believes it is in anyway. (With the exception of string literals, where it makes sense to assume that the user believes it is in the same encoding as the source code -- but I believe non-ASCII characters in string literals are disallowed anyway, or at least known to cause undefined results in rats.) > I'm so accustomed to being wary of encoding issues that the idea doesn't > *feel* right at first - I keep going, "but you can't know what encoding > those bytes are". Then I go, Duh, that's the point. If you convert > str->bytes, there's no conversion and no interpretation - neither the str > nor the bytes object knows its encoding, and that's okay. So > str(bytes_object) (in 2.x) should also just turn it back to a normal > bytestring. You've got me convinced. Scrap my previous responses in this thread. > In fact, the 'encoding' argument seems useless in the case of str objects, Right. > and it seems it should default to latin-1 for unicode objects. But here I disagree. > The only > use I see for having an encoding for a 'str' would be to allow confirming > that the input string in fact is valid for that encoding. So, > "bytes(some_str,'ascii')" would be an assertion that some_str must be valid > ASCII. We already have ways to assert that a string is ASCII. > For 3.0, the type formerly known as "str" won't exist, so only the Unicode > part will be relevant then. And I think then the encoding should be required or default to ASCII. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Phillip J. Eby wrote: [snip..] > > In fact, the 'encoding' argument seems useless in the case of str objects, > and it seems it should default to latin-1 for unicode objects. The only > -1 for having an implicit encode that behaves differently to other implicit encodes/decodes that happen in Python. Life is confusing enough already. Michael Foord ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote: > Phillip J. Eby wrote: > [snip..] > > > > In fact, the 'encoding' argument seems useless in the case of str objects, > > and it seems it should default to latin-1 for unicode objects. The only > > > -1 for having an implicit encode that behaves differently to other > implicit encodes/decodes that happen in Python. Life is confusing enough > already. But adding an encoding doesn't help. The str.encode() method always assumes that the string itself is ASCII-encoded, and that's not good enough: >>> "abc".encode("latin-1") 'abc' >>> "abc".decode("latin-1") u'abc' >>> "abc\xf0".decode("latin-1") u'abc\xf0' >>> "abc\xf0".encode("latin-1") Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 3: ordinal not in range(128) >>> The right way to look at this is, as Phillip says, to consider conversion between str and bytes as not an encoding but a data type change *only*. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Mon, 2006-02-13 at 15:44 -0800, Guido van Rossum wrote: > The right way to look at this is, as Phillip says, to consider > conversion between str and bytes as not an encoding but a data type > change *only*. That sounds right to me too. -Barry signature.asc Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote: > >> Phillip J. Eby wrote: >> [snip..] >> >>> In fact, the 'encoding' argument seems useless in the case of str objects, >>> and it seems it should default to latin-1 for unicode objects. The only >>> >>> >> -1 for having an implicit encode that behaves differently to other >> implicit encodes/decodes that happen in Python. Life is confusing enough >> already. >> > > But adding an encoding doesn't help. The str.encode() method always > assumes that the string itself is ASCII-encoded, and that's not good > enough: > > Sorry - I meant for the unicode to bytes case. A default encoding that behaves differently to the current to implicit encodes/decodes would be confusing IMHO. I agree that string to bytes shouldn't change the value of the bytes. The least confusing description of a non-unicode string is 'byte-string'. Michael Foord "abc".encode("latin-1") > 'abc' > "abc".decode("latin-1") > u'abc' > "abc\xf0".decode("latin-1") > u'abc\xf0' > "abc\xf0".encode("latin-1") > Traceback (most recent call last): > File "", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position > 3: ordinal not in range(128) > > > The right way to look at this is, as Phillip says, to consider > conversion between str and bytes as not an encoding but a data type > change *only*. > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 03:23 PM 2/13/2006 -0800, Guido van Rossum wrote: >On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > > The only > > use I see for having an encoding for a 'str' would be to allow confirming > > that the input string in fact is valid for that encoding. So, > > "bytes(some_str,'ascii')" would be an assertion that some_str must be valid > > ASCII. > >We already have ways to assert that a string is ASCII. I didn't mean that it was the only purpose. In Python 2.x, practical code has to sometimes deal with "string-like" objects. That is, code that takes either strings or unicode. If such code calls bytes(), it's going to want to include an encoding so that unicode conversions won't fail. But silently ignoring the encoding argument in that case isn't a good idea. Ergo, I propose to permit the encoding to be specified when passing in a (2.x) str object, to allow code that handles both str and unicode to be "str-stable" in 2.x. I'm fine with rejecting an encoding argument if the initializer is not a str or unicode; I just don't want the call signature to vary based on a runtime distinction between str and unicode. And, I don't want the encoding argument to be silently ignored when you pass in a string. If I assert that I'm encoding ASCII (or utf-8 or whatever), then the string should be required to be valid. If I don't pass in an encoding, then I'm good to go. (This is orthogonal to the issue of what encoding is used as a default for conversions from the unicode type, btw.) > > For 3.0, the type formerly known as "str" won't exist, so only the Unicode > > part will be relevant then. > >And I think then the encoding should be required or default to ASCII. The reason I'm arguing for latin-1 is symmetry in 2.x versions only. (In 3.x, there's no str vs. unicode, and thus nothing to be symmetrical.) So, if you invoke bytes() without an encoding on a 2.x basestring, you should get the same result. Latin-1 produces "the same result" when viewed in terms of the resulting byte string. If we don't go with latin-1, I'd argue for requiring an encoding for unicode objects in 2.x, because that seems like the only reasonable way to break the symmetry between str and unicode, even though it forces "str-stable" code to specify an encoding. The key is that at least *one* of the signatures needs to be stable in meaning across both str and unicode in 2.x in order to allow unicode-safe, str-stable code to be written. (Again, for 3.x, this issue doesn't come into play because there's only one string type to worry about; what the default is or whether there's a default is therefore entirely up to you.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote: > Sorry - I meant for the unicode to bytes case. A default encoding that > behaves differently to the current to implicit encodes/decodes would be > confusing IMHO. And I am in agreement with you there (I think only Phillip argued otherwise). > I agree that string to bytes shouldn't change the value of the bytes. It's a deal then. Can the owner of PEP 332 update the PEP to record these decisions? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > I didn't mean that it was the only purpose. In Python 2.x, practical code > has to sometimes deal with "string-like" objects. That is, code that takes > either strings or unicode. If such code calls bytes(), it's going to want > to include an encoding so that unicode conversions won't fail. That sounds like a rather hypothetical example. Have you thought it through? Presumably code that accepts both str and unicode either doesn't care about encodings, but simply returns objects of the same type as the arguments -- and then it's unlikely to want to convert the arguments to bytes; or it *does* care about encodings, and then it probably already has to special-case str vs. unicode because it has to control how str objects are interpreted. > But > silently ignoring the encoding argument in that case isn't a good idea. > > Ergo, I propose to permit the encoding to be specified when passing in a > (2.x) str object, to allow code that handles both str and unicode to be > "str-stable" in 2.x. Again, have you thought this through? What would bytes("abc\xf0", "latin-1") *mean*? Take the string "abc\xf0", interpret it as being encoded in XXX, and then encode from XXX to Latin-1. But what's XXX? As I showed in a previous post, "abc\xf0".encode("latin-1") *fails* because the source for the encoding is assumed to be ASCII. I think we can make this work only when the string in fact only contains ASCII and the encoding maps ASCII to itself (which most encodings do -- but e.g. EBCDIC does not). But I'm not sure how useful that is. > I'm fine with rejecting an encoding argument if the initializer is not a > str or unicode; I just don't want the call signature to vary based on a > runtime distinction between str and unicode. I'm still not sure that this will actually help anyone. > And, I don't want the > encoding argument to be silently ignored when you pass in a string. Agreed. > If I > assert that I'm encoding ASCII (or utf-8 or whatever), then the string > should be required to be valid. Defined how? That the string is already in that encoding? > If I don't pass in an encoding, then I'm > good to go. > > (This is orthogonal to the issue of what encoding is used as a default for > conversions from the unicode type, btw.) Right. The issues are completely different! > > > For 3.0, the type formerly known as "str" won't exist, so only the Unicode > > > part will be relevant then. > > > >And I think then the encoding should be required or default to ASCII. > > The reason I'm arguing for latin-1 is symmetry in 2.x versions only. (In > 3.x, there's no str vs. unicode, and thus nothing to be symmetrical.) So, > if you invoke bytes() without an encoding on a 2.x basestring, you should > get the same result. Latin-1 produces "the same result" when viewed in > terms of the resulting byte string. Only if you assume the str object is encoded in Latin-1. Your argument for symmetry would be a lot stronger if we used Latin-1 for the conversion between str and Unicode. But we don't. I like the other interpretation (which I thought was yours too?) much better: str <--> bytes conversions don't use encodings by simply change the type without changing the bytes; conversion between either and unicode works exactly the same, and requires an encoding unless all the characters involved are pure ASCII. > If we don't go with latin-1, I'd argue for requiring an encoding for > unicode objects in 2.x, because that seems like the only reasonable way to > break the symmetry between str and unicode, even though it forces > "str-stable" code to specify an encoding. The key is that at least *one* > of the signatures needs to be stable in meaning across both str and unicode > in 2.x in order to allow unicode-safe, str-stable code to be written. Using ASCII as the default encoding has the same property -- it can remain stable across the 2.x / 3.0 boundary. > (Again, for 3.x, this issue doesn't come into play because there's only one > string type to worry about; what the default is or whether there's a > default is therefore entirely up to you.) A nice-to-have property would be that it might be possible to write code that today deals with Unicode and str, but in 3.0 will deal with Unicode and bytes instead. But I'm not sure how likely that is since bytes objects won't have most methods that str and Unicode objects have (like lower(), find(), etc.). There's one property that bytes, str and unicode all share: type(x[0]) == type(x), at least as long as len(x) >= 1. This is perhaps the ultimate test for string-ness. Or should b[0] be an int, if b is a bytes object? That would change things dramatically. There's also the consideration for APIs that, informally, accept either a string or a sequence of objects. Many of these exist, and they are probably all being converted to support unicode as well as str (if it makes sense at all). Should a bytes object be considered as a sequen
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 13, 2006, at 7:09 PM, Guido van Rossum wrote: > On 2/13/06, Michael Foord <[EMAIL PROTECTED]> wrote: >> Sorry - I meant for the unicode to bytes case. A default encoding >> that >> behaves differently to the current to implicit encodes/decodes >> would be >> confusing IMHO. > > And I am in agreement with you there (I think only Phillip argued > otherwise). > >> I agree that string to bytes shouldn't change the value of the bytes. > > It's a deal then. > > Can the owner of PEP 332 update the PEP to record these decisions? So, in python2.X, you have: - bytes("\x80"), you get a bytestring with a single byte of value 0x80 (when no encoding is specified, and the object is a str, it doesn't try to encode it at all). - bytes("\x80", encoding="latin-1"), you get an error, because encoding "\x80" into latin-1 implicitly decodes it into a unicode object first, via the system-wide default: ascii. - bytes(u"\x80"), you get an error, because the default encoding for a unicode string is ascii. - bytes(u"\x80", encoding="latin-1"), you get a bytestring with a single byte of value 0x80. In py3k, when the str object is eliminated, then what do you have? Perhaps - bytes("\x80"), you get an error, encoding is required. There is no such thing as "default encoding" anymore, as there's no str object. - bytes("\x80", encoding="latin-1"), you get a bytestring with a single byte of value 0x80. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, James Y Knight <[EMAIL PROTECTED]> wrote: > So, in python2.X, you have: > - bytes("\x80"), you get a bytestring with a single byte of value > 0x80 (when no encoding is specified, and the object is a str, it > doesn't try to encode it at all). > - bytes("\x80", encoding="latin-1"), you get an error, because > encoding "\x80" into latin-1 implicitly decodes it into a unicode > object first, via the system-wide default: ascii. > - bytes(u"\x80"), you get an error, because the default encoding for > a unicode string is ascii. > - bytes(u"\x80", encoding="latin-1"), you get a bytestring with a > single byte of value 0x80. Yes to all. > In py3k, when the str object is eliminated, then what do you have? > Perhaps > - bytes("\x80"), you get an error, encoding is required. There is no > such thing as "default encoding" anymore, as there's no str object. > - bytes("\x80", encoding="latin-1"), you get a bytestring with a > single byte of value 0x80. Yes to both again. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum <[EMAIL PROTECTED]> wrote: >> In py3k, when the str object is eliminated, then what do you have? >> Perhaps >> - bytes("\x80"), you get an error, encoding is required. There is no >> such thing as "default encoding" anymore, as there's no str object. >> - bytes("\x80", encoding="latin-1"), you get a bytestring with a >> single byte of value 0x80. > > Yes to both again. I haven't been following this dicussion about bytes() real closely but I don't think that bytes() should do the encoding. We already have a way to spell that: "\x80".encode('latin-1') Also, I think it would useful to introduce byte array literals at the same time as the bytes object. That would allow people to use byte arrays without having to get involved with all the silly string encoding confusion. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Monday 13 February 2006 21:52, Neil Schemenauer wrote: > Also, I think it would useful to introduce byte array literals at > the same time as the bytes object. That would allow people to use > byte arrays without having to get involved with all the silly string > encoding confusion. bytes([0, 1, 2, 3]) -Fred -- Fred L. Drake, Jr. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote: > Guido van Rossum <[EMAIL PROTECTED]> wrote: > >> In py3k, when the str object is eliminated, then what do you have? > >> Perhaps > >> - bytes("\x80"), you get an error, encoding is required. There is no > >> such thing as "default encoding" anymore, as there's no str object. > >> - bytes("\x80", encoding="latin-1"), you get a bytestring with a > >> single byte of value 0x80. > > > > Yes to both again. > > I haven't been following this dicussion about bytes() real closely > but I don't think that bytes() should do the encoding. We already > have a way to spell that: > > "\x80".encode('latin-1') But in 2.5 we can't change that to return a bytes object without creating HUGE incompatibilities. In general I've come to appreciate that there are two ways of converting an object of type A to an object of type B: ask an A instance to convert itself to a B, or ask the type B to create a new instance from an A. Depending on what A and B are, both APIs make sense; sometimes reasons of decoupling require that A can't know about B, in which case you have to use the latter approach; sometimes B can't know about A, in which case you have to use the former. Even when A == B we sometimes support both APIs: to create a new list from a list a, you can write a[:] or list(a); to create a new dict from a dict d, you can write d.copy() or dict(d). An advantage of the latter API is that there's no confusion about the resulting type -- dict(d) is definitely a dict, and list(a) is definitely a list. Not so for d.copy() or a[:] -- if the input type is another mapping or sequence, it'll probably return an object of that same type. Again, it depends on the application which is better. I think that bytes(s, ) is fine, especially for expressing a new type, since it is unambiguous about the result type, and has no backwards compatibility issues. > Also, I think it would useful to introduce byte array literals at > the same time as the bytes object. That would allow people to use > byte arrays without having to get involved with all the silly string > encoding confusion. You missed the part where I said that introducing the bytes type *without* a literal seems to be a good first step. A new type, even built-in, is much less drastic than a new literal (which requires lexer and parser support in addition to everything else). -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 13, 2006, at 7:29 PM, Guido van Rossum wrote: > There's one property that bytes, str and unicode all share: type(x[0]) > == type(x), at least as long as len(x) >= 1. This is perhaps the > ultimate test for string-ness. But not perfect, since of course other containers can contain objects of their own type too. But it leads to an interesting issue... > Or should b[0] be an int, if b is a bytes object? That would change > things dramatically. This makes me think I want an unsigned byte type, which b[0] would return. In another thread I think someone mentioned something about fixed width integral types, such that you could have an object that was guaranteed to be 8-bits wide, 16-bits wide, etc. Maybe you also want signed and unsigned versions of each. This may seem like YAGNI to many people, but as I've been working on a tightly embedded/ extended application for the last few years, I've definitely had occasions where I wish I could more closely and more directly model my C values as Python objects (without using the standard workarounds or writing my own C extension types). But anyway, without hyper-generalizing, it's still worth asking whether a bytes type is just a container of byte objects, where the contained objects would be distinct, fixed 8-bit unsigned integral types. > There's also the consideration for APIs that, informally, accept > either a string or a sequence of objects. Many of these exist, and > they are probably all being converted to support unicode as well as > str (if it makes sense at all). Should a bytes object be considered as > a sequence of things, or as a single thing, from the POV of these > types of APIs? Should we try to standardize how code tests for the > difference? (Currently all sorts of shortcuts are being taken, from > isinstance(x, (list, tuple)) to isinstance(x, basestring).) I think bytes objects are very much like string objects today -- they're the photons of Python since they can act like either sequences or scalars, depending on the context. For example, we have code that needs to deal with situations where an API can return either a scalar or a sequence of those scalars. So we have a utility function like this: def thingiter(obj): try: it = iter(obj) except TypeError: yield obj else: for item in it: yield item Maybe there's a better way to do this, but the most obvious problem is that (for our use cases), this fails for strings because in this context we want strings to act like scalars. So we add a little test just before the "try:" like "if isinstance(obj, basestring): yield obj". But that's yucky. I don't know what the solution is -- if there /is/ a solution short of special case tests like above, but I think the key observation is that sometimes you want your string to act like a sequence and sometimes you want it to act like a scalar. I suspect bytes objects will be the same way. -Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote: >On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > > I didn't mean that it was the only purpose. In Python 2.x, practical code > > has to sometimes deal with "string-like" objects. That is, code that takes > > either strings or unicode. If such code calls bytes(), it's going to want > > to include an encoding so that unicode conversions won't fail. > >That sounds like a rather hypothetical example. Have you thought it >through? Presumably code that accepts both str and unicode either >doesn't care about encodings, but simply returns objects of the same >type as the arguments -- and then it's unlikely to want to convert the >arguments to bytes; or it *does* care about encodings, and then it >probably already has to special-case str vs. unicode because it has to >control how str objects are interpreted. Actually, it's the other way around. Code that wants to output uninterpreted bytes right now and accepts either strings or Unicode has to special-case *unicode* -- not str, because str is the only "bytes type" we currently have. This creates an interesting issue in WSGI for Jython, which of course only has one (unicode-based) string type now. Since there's no bytes type in Python in general, the only solution we could come up with was to treat such strings as latin-1: http://www.python.org/peps/pep-0333.html#unicode-issues This is why I'm biased towards latin-1 encoding of unicode to bytes; it's "the same thing" as an uninterpreted string of bytes. I think the difference in our viewpoints is that you're still thinking "string" thoughts, whereas I'm thinking "byte" thoughts. Bytes are just bytes; they don't *have* an encoding. So, if you think of "converting a string to bytes" as meaning "create an array of numerals corresponding to the characters in the string", then this leads to a uniform result whether the characters are in a str or a unicode object. In other words, to me, bytes(str_or_unicode) should be treated as: bytes(map(ord, str_or_unicode)) In other words, without an encoding, bytes() should simply treat str and unicode objects *as if they were a sequence of integers*, and produce an error when an integer is out of range. This is a logical and consistent interpretation in the absence of an encoding, because in that case you don't care about the encoding - it's just raw data. If, however, you include an encoding, then you're stating that you want to encode the *meaning* of the string, not merely its integer values. >What would bytes("abc\xf0", "latin-1") *mean*? Take the string >"abc\xf0", interpret it as being encoded in XXX, and then encode from >XXX to Latin-1. But what's XXX? As I showed in a previous post, >"abc\xf0".encode("latin-1") *fails* because the source for the >encoding is assumed to be ASCII. I'm saying that XXX would be the same encoding as you specified. i.e., including an encoding means you are encoding the *meaning* of the string. However, I believe I mainly proposed this as an alternative to having bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I think is probably a saner default. >Your argument for symmetry would be a lot stronger if we used Latin-1 >for the conversion between str and Unicode. But we don't. But that's because we're dealing with its meaning *as a string*, not merely as ordinals in a sequence of bytes. > I like the >other interpretation (which I thought was yours too?) much better: str ><--> bytes conversions don't use encodings by simply change the type >without changing the bytes; I like it better too. The part you didn't like was where MAL and I believe this should be extended to Unicode characters in the 0-255 range also. :) >There's one property that bytes, str and unicode all share: type(x[0]) >== type(x), at least as long as len(x) >= 1. This is perhaps the >ultimate test for string-ness. > >Or should b[0] be an int, if b is a bytes object? That would change >things dramatically. +1 for it being an int. Heck, I'd want to at least consider the possibility of introducing a character type (chr?) in Python 3.0, and getting rid of the "iterating a string yields strings" characteristic. I've found it to be a bit of a pain when dealing with heterogeneous nested sequences that contain strings. >There's also the consideration for APIs that, informally, accept >either a string or a sequence of objects. Many of these exist, and >they are probably all being converted to support unicode as well as >str (if it makes sense at all). Should a bytes object be considered as >a sequence of things, or as a single thing, from the POV of these >types of APIs? Should we try to standardize how code tests for the >difference? (Currently all sorts of shortcuts are being taken, from >isinstance(x, (list, tuple)) to isinstance(x, basestring).) I'm inclined to think of certain features at least in terms of the buffer interface, but
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
M.-A. Lemburg wrote: > We're talking about Py3k here: "abc" will be a Unicode string, > so why restrict the conversion to 7 bits when you can have 8 bits > without any conversion problems ? YAGNI. If you have a need for byte string in source code, it will typically be "random" bytes, which can be nicely used through bytes([0x73, 0x9f, 0x44, 0xd2, 0xfb, 0x49, 0xa3, 0x14, 0x8b, 0xee]) For larger blocks, people should use base64.string_to_bytes (which can become a synonym for base64.decodestring in Py3k). If you have bytes that are meaningful text for some application (say, a wire protocol), it is typically ASCII-Text. No protocol I know of uses non-ASCII characters for protocol information. Of course, you need a way to get .encode output as bytes somehow, both in 2.5, and in Py3k. I suggest writing bytes(s.encode(encoding)) In 2.5, bytes() can be constructed from strings, and will do a conversion; in Py3k, .encode will already return a string, so this will be a no-op. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Phillip J. Eby wrote: > I was just pointing out that since byte strings are bytes by definition, > then simply putting those bytes in a bytes() object doesn't alter the > existing encoding. So, using latin-1 when converting a string to bytes > actually seems like the the One Obvious Way to do it. This is a misconception. In Python 2.x, the type str already *is* a bytes type. So if S is an instance of 2.x str, bytes(S) does not need to do any conversion. You don't need to assume it is latin-1: it's already bytes. > In fact, the 'encoding' argument seems useless in the case of str objects, > and it seems it should default to latin-1 for unicode objects. I agree with the former, but not with the latter. There shouldn't be a conversion of Unicode objects to bytes at all. If you want bytes from a Unicode string U, write bytes(U.encode(encoding)) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: >>In py3k, when the str object is eliminated, then what do you have? >>Perhaps >>- bytes("\x80"), you get an error, encoding is required. There is no >>such thing as "default encoding" anymore, as there's no str object. >>- bytes("\x80", encoding="latin-1"), you get a bytestring with a >>single byte of value 0x80. > > > Yes to both again. Please reconsider, and don't give bytes() an encoding= argument. It doesn't need one. In Python 3, people should write "\x80".encode("latin-1") if they absolutely want to, although they better write bytes([0x80]) Now, the first form isn't valid in 2.5, but bytes(u"\x80".encode("latin-1")) could work in all versions. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > M.-A. Lemburg wrote: > > We're talking about Py3k here: "abc" will be a Unicode string, > > so why restrict the conversion to 7 bits when you can have 8 bits > > without any conversion problems ? > > YAGNI. If you have a need for byte string in source code, it will > typically be "random" bytes, which can be nicely used through > > bytes([0x73, 0x9f, 0x44, 0xd2, 0xfb, 0x49, 0xa3, 0x14, 0x8b, 0xee]) > > For larger blocks, people should use base64.string_to_bytes (which > can become a synonym for base64.decodestring in Py3k). > > If you have bytes that are meaningful text for some application > (say, a wire protocol), it is typically ASCII-Text. No protocol > I know of uses non-ASCII characters for protocol information. What would that imply for repr()? To support eval(repr(x)) it would have to produce whatever format the source code includes to begin with. If I understand correctly there's three main candidates: 1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x 2. Direct copying to str/unicode if it's only ascii values, switching to a list of hex literals if there's any non-ascii values 3. b"foo" literal with ascii for all ascii characters (other than \ and "), \xFF for individual characters that aren't ascii Given the choice I prefer the third option, with the second option as my runner up. The first option just screams "silent errors" to me. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 14, 2006, at 12:20 AM, Phillip J. Eby wrote: > bytes(map(ord, str_or_unicode)) > > In other words, without an encoding, bytes() should simply treat > str and > unicode objects *as if they were a sequence of integers*, and > produce an > error when an integer is out of range. This is a logical and > consistent > interpretation in the absence of an encoding, because in that case you > don't care about the encoding - it's just raw data. If you're talking about "raw data", then make bytes(unicodestring) produce what buffer(unicodestring) currently does -- something completely and utterly worthless. :) [it depends on how you compiled python and what endianness your system has.] There really is no case where you don't care about the encoding...there is always a specific desired output encoding, and you have to think about what encoding that is. The argument that latin-1 is a sensible default just because you can convert to latin-1 by chopping off the upper 3 bytes of a unicode character's ordinal position is not convincing; you're still doing an encoding operation, it just happens to be computationally easy. That Jython programs have to pretend that unicode strings are an appropriate way to store bytes, and thus often have to do fake "latin-1" conversions which are really no such thing, doesn't make a convincing argument either. Using unicode strings to store bytes read from or written to a socket is really just broken. Actually having any default encoding at all is IMO a poor idea, but as python has one at the moment (ascii), might as well keep using it for consistency until it's eliminated (sys.setdefaultencoding ('undefined') is my friend.) James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Adam Olsen wrote: > What would that imply for repr()? To support eval(repr(x)) I don't think eval(repr(x)) needs to be supported for the bytes type. However, if that is desirable, it should return something like bytes([1,2,3]) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote: > But adding an encoding doesn't help. The str.encode() method always > assumes that the string itself is ASCII-encoded, and that's not good > enough: > >>> "abc".encode("latin-1") > 'abc' > >>> "abc".decode("latin-1") > u'abc' > >>> "abc\xf0".decode("latin-1") > u'abc\xf0' > >>> "abc\xf0".encode("latin-1") > Traceback (most recent call last): > File "", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position > 3: ordinal not in range(128) These comments disturb me. I never really understood why (byte) strings grew the 'encode' method, since 8-bit strings *are already encoded*, by their very nature. I mean, I understand it's useful because Python does non-unicode encodings like 'hex', but I don't really understand *why*. The benefits don't seem to outweigh the cost (but that's hindsight.) Directly encoding a (byte) string into a unicode encoding is mostly useless, as you've shown. The only use-case I can think of is translating ASCII in, for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op, unless the system encoding isn't 'ascii' (and that's pretty rare, and not something a Python programmer should depend on.) On the other hand, the fact that (byte) strings have an 'encode' method creates a lot of confusion in unicode-newbies, and causes programs to break only when input is non-ASCII. And non-ASCII input just happens too often and too unpredictably in 'real-world' code, and not enough in European programmers' tests ;P Unicode objects and strings are not the same thing. We shouldn't treat them as the same thing. They share an interface (like lists and tuples do), and if you only use that interface, treating them as the same kind object is mostly ok. They actually share *less* of an interface than lists and tuples, though, as comparing strings to unicode objects can raise an exception, whereas comparing lists to tuples is not expected to. For anything less trivial than indexing, slicing and most of the string methods, and anything what so ever involving non-ASCII (or, rather, non-system-encoding), unicode objects and strings *must* be treated separately. For instance, there is no correct way to do: s.split("\x80") unless you know the type of 's'. If it's unicode, you want u"\x80" instead of "\x80". If it's not unicode, splitting "\x80" may not even be sensible, but you wouldn't know from looking at the code -- maybe it expects a specific encoding (or encoding family), maybe not. As soon as you deal with unicode, you need to really understand the concept, and too many programmers don't. And it's very hard to tell from someone's comments whether they fail to understand or just get some of the terminology wrong; that's why Guido's comments about 'encoding a byte string' and 'what if the file encoding is Unicode' scare me. The unicode/string mixup almost makes me wish Python was statically typed. So please, please, please don't make the mistake of 'doing something' with the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string. It wouldn't actually be usable except for the same things as 'str.encode': to convert from ASCII to non-ASCII-supersets, or to convert to non-unicode encodings (such as 'hex'.) You can achieve those two by doing, e.g., 'bytes(s.encode('hex'))' if you really want to. Ignoring the encoding (rather than raising an exception) would also allow code to be trivially portable between Python 2.x and Py3K, when "" is actually a unicode object. Not that I'm happy with ignoring anything, but not ignoring would be bigger crime here. Oh, and while on the subject, I'm not convinced going all-unicode in Py3K is a good idea either, but maybe I should save that discussion for PyCon. I'm not thinking "why do we need unicode" anymore (which I did two years ago ;) but I *am* thinking it'll be a big step for 90% of the programmers if they have to grasp unicode and encodings to be able to even do 'raw_input()' sensibly. I know I spend an inordinate amount of time trying to explain the basics on #python on irc.freenode.net already. -- Thomas Wouters <[EMAIL PROTECTED]> Hi! I'm a .signature virus! copy me into your .signature file to help me spread! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > I also wonder if having a b"..." literal would just add more confusion > -- bytes are not characters, but b"..." makes it appear as if they > are. I'm inclined to agree. Bytes objects are more likely to be used for things which are *not* characters -- if they're characters, they would be better kept in strings or char arrays. +1 on any eventual bytes literal looking completely different from a string literal. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > There's also the consideration for APIs that, informally, accept > either a string or a sequence of objects. My preference these days is not to design APIs that way. It's never necessary and it avoids a lot of problems. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Barry Warsaw wrote: > This makes me think I want an unsigned byte type, which b[0] would > return. Come to think of it, this is something I don't remember seeing discussed. I've been thinking that bytes[i] would return an integer, but is the intention that it would return another bytes object? Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > In general I've come to appreciate that there are two ways of > converting an object of type A to an object of type B: ask an A > instance to convert itself to a B, or ask the type B to create a new > instance from an A. And the difference between the two isn't even always that clear cut. Sometimes you'll ask type B to create a new instance from an A, and then while you're not looking type B cheats and goes and asks the A instance to do it instead ;) Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://www.boredomandlaziness.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Adam Olsen wrote: > > What would that imply for repr()? To support eval(repr(x)) > > I don't think eval(repr(x)) needs to be supported for the bytes > type. However, if that is desirable, it should return something > like > > bytes([1,2,3]) I'm starting to wonder, do we really need anything fancy? Wouldn't it be sufficient to have a way to compactly store 8-bit integers? In 2.x we could convert unicode like this: bytes(ord(c) for c in u"It's...".encode('utf-8')) u"It's...".byteencode('utf-8') # Shortcut for above In 3.0 it changes to: "It's...".encode('utf-8') u"It's...".byteencode('utf-8') # Same as above, kept for compatibility Passing a str or unicode directly to bytes() would be an error. repr(bytes(...)) would produce bytes([1,2,3]). Probably need a __bytes__() method that print can call, or even better a __print__(file) method[0]. The write() methods would of course have to support bytes objects. I realize it would be odd for the interactive interpret to print them as a list of ints by default: >>> u"It's...".byteencode('utf-8') [73, 116, 39, 115, 46, 46, 46] But maybe it's time we stopped hiding the real nature of bytes from users? [0] By this I mean calling objects recursively and telling them what file to print to, rather than getting a temporary string from them and printing that. I always wondered why you could do that from C extensions but not from Python code. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Greg Ewing <[EMAIL PROTECTED]> writes: > Guido van Rossum wrote: > >> There's also the consideration for APIs that, informally, accept >> either a string or a sequence of objects. > > My preference these days is not to design APIs that > way. It's never necessary and it avoids a lot of > problems. Oh yes. Cheers, mwh -- ZAPHOD: Listen three eyes, don't try to outweird me, I get stranger things than you free with my breakfast cereal. -- The Hitch-Hikers Guide to the Galaxy, Episode 7 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 14, 2006, at 6:35 AM, Greg Ewing wrote: > Barry Warsaw wrote: > >> This makes me think I want an unsigned byte type, which b[0] would >> return. > > Come to think of it, this is something I don't > remember seeing discussed. I've been thinking > that bytes[i] would return an integer, but is > the intention that it would return another bytes > object? A related question: what would bytes([104, 101, 108, 108, 111, 8004]) return? An exception hopefully. I also think you'd want bytes([x for x in some_bytes_object]) to return an object equal to the original. -Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 14, 2006, at 1:52 AM, Martin v. Löwis wrote: > Phillip J. Eby wrote: >> I was just pointing out that since byte strings are bytes by >> definition, >> then simply putting those bytes in a bytes() object doesn't alter the >> existing encoding. So, using latin-1 when converting a string to >> bytes >> actually seems like the the One Obvious Way to do it. > > This is a misconception. In Python 2.x, the type str already *is* a > bytes type. So if S is an instance of 2.x str, bytes(S) does not need > to do any conversion. You don't need to assume it is latin-1: it's > already bytes. > >> In fact, the 'encoding' argument seems useless in the case of str >> objects, >> and it seems it should default to latin-1 for unicode objects. > > I agree with the former, but not with the latter. There shouldn't be a > conversion of Unicode objects to bytes at all. If you want bytes from > a Unicode string U, write > > bytes(U.encode(encoding)) I like it, it makes sense. Unicode strings are simply not allowed as arguments to the byte constructor. Thinking about it, why would it be otherwise? And if you're mixing str-strings and unicode-strings, that means the str-strings you're sometimes giving are actually not byte strings, but character strings anyhow, so you should be encoding those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling. Kill the encoding argument, and you're left with: Python2.X: - bytes(bytes_object) -> copy constructor - bytes(str_object) -> copy the bytes from the str to the bytes object - bytes(sequence_of_ints) -> make bytes with the values of the ints, error on overflow Python3.X removes str, and most APIs that did return str return bytes instead. Now all you have is: - bytes(bytes_object) -> copy constructor - bytes(sequence_of_ints) -> make bytes with the values of the ints, error on overflow Nice and simple. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
At 11:08 AM 2/14/2006 -0500, James Y Knight wrote: >On Feb 14, 2006, at 1:52 AM, Martin v. Löwis wrote: > >>Phillip J. Eby wrote: >>>I was just pointing out that since byte strings are bytes by >>>definition, >>>then simply putting those bytes in a bytes() object doesn't alter the >>>existing encoding. So, using latin-1 when converting a string to >>>bytes >>>actually seems like the the One Obvious Way to do it. >> >>This is a misconception. In Python 2.x, the type str already *is* a >>bytes type. So if S is an instance of 2.x str, bytes(S) does not need >>to do any conversion. You don't need to assume it is latin-1: it's >>already bytes. >> >>>In fact, the 'encoding' argument seems useless in the case of str >>>objects, >>>and it seems it should default to latin-1 for unicode objects. >> >>I agree with the former, but not with the latter. There shouldn't be a >>conversion of Unicode objects to bytes at all. If you want bytes from >>a Unicode string U, write >> >> bytes(U.encode(encoding)) > >I like it, it makes sense. Unicode strings are simply not allowed as >arguments to the byte constructor. Thinking about it, why would it be >otherwise? And if you're mixing str-strings and unicode-strings, that >means the str-strings you're sometimes giving are actually not byte >strings, but character strings anyhow, so you should be encoding >those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling. Actually, I think you mean: if isinstance(s_or_U, str): s_or_U = s_or_U.decode('utf-8') b = bytes(s_or_U.encode('utf-8')) Or maybe: if isinstance(s_or_U, unicode): s_or_U = s_or_U.encode('utf-8') b = bytes(s_or_U) Which is why I proposed that the boilerplate logic get moved *into* the bytes constructor. I think this use case is going to be common in today's Python, but in truth I'm not as sure what bytes() will get used *for* in today's Python. I'm probably overprojecting based on the need to use str objects now, but bytes aren't going to be a replacement for str for a good while anyway. >Kill the encoding argument, and you're left with: > >Python2.X: >- bytes(bytes_object) -> copy constructor >- bytes(str_object) -> copy the bytes from the str to the bytes object >- bytes(sequence_of_ints) -> make bytes with the values of the ints, >error on overflow > >Python3.X removes str, and most APIs that did return str return bytes >instead. Now all you have is: >- bytes(bytes_object) -> copy constructor >- bytes(sequence_of_ints) -> make bytes with the values of the ints, >error on overflow > >Nice and simple. I could certainly live with that approach, and it certainly rules out all the "when does the encoding argument apply and when should it be an error to pass it" questions. :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
James Y Knight wrote: > Kill the encoding argument, and you're left with: > > Python2.X: > - bytes(bytes_object) -> copy constructor > - bytes(str_object) -> copy the bytes from the str to the bytes object > - bytes(sequence_of_ints) -> make bytes with the values of the ints, > error on overflow > > Python3.X removes str, and most APIs that did return str return bytes > instead. Now all you have is: > - bytes(bytes_object) -> copy constructor > - bytes(sequence_of_ints) -> make bytes with the values of the ints, > error on overflow > > Nice and simple. Albeit, too simple. The above approach would basically remove the possibility to easily create bytes() from literals in Py3k, since literals in Py3k create Unicode objects, e.g. bytes("123") would not work in Py3k. It's hard to imagine how you'd provide a decent upgrade path for bytes() if you introduce the above semantics in Py2.x. People would start writing bytes("123") in Py2.x and expect it to also work in Py3k, which it wouldn't. To prevent this, you'd have to outrule bytes() construction from strings altogether, which doesn't look like a viable option either. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 14 2006) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
James Y Knight <[EMAIL PROTECTED]> wrote: > I like it, it makes sense. Unicode strings are simply not allowed as > arguments to the byte constructor. Thinking about it, why would it be > otherwise? And if you're mixing str-strings and unicode-strings, that > means the str-strings you're sometimes giving are actually not byte > strings, but character strings anyhow, so you should be encoding > those too. bytes(s_or_U.encode('utf-8')) is a perfectly good spelling. I also like the removal of the encoding... > Kill the encoding argument, and you're left with: > > Python2.X: > - bytes(bytes_object) -> copy constructor > - bytes(str_object) -> copy the bytes from the str to the bytes object > - bytes(sequence_of_ints) -> make bytes with the values of the ints, > error on overflow > > Python3.X removes str, and most APIs that did return str return bytes > instead. Now all you have is: > - bytes(bytes_object) -> copy constructor > - bytes(sequence_of_ints) -> make bytes with the values of the ints, > error on overflow What's great is that this already works: >>> import array >>> array.array('b', [1,2,3]) array('b', [1, 2, 3]) >>> array.array('b', "hello") array('b', [104, 101, 108, 108, 111]) >>> array.array('b', u"hello") Traceback (most recent call last): File "", line 1, in ? TypeError: array initializer must be list or string >>> array.array('b', [150]) Traceback (most recent call last): File "", line 1, in ? OverflowError: signed char is greater than maximum >>> array.array('B', [150]) array('B', [150]) >>> array.array('B', [350]) Traceback (most recent call last): File "", line 1, in ? OverflowError: unsigned byte integer is greater than maximum And out of the deal we can get both signed and unsigned ints. Re: Adam Olsen > I'm starting to wonder, do we really need anything fancy? Wouldn't it > be sufficient to have a way to compactly store 8-bit integers? It already exists. It could just use another interface. The buffer interface offers any array the ability to return strings. That may have to change to return bytes objects in Py3k. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > On 2/13/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >> Guido van Rossum wrote: >>> It'd be cruel and unusual punishment though to have to write >>> >>> bytes("abc", "Latin-1") >>> >>> I propose that the default encoding (for basestring instances) ought >>> to be "ascii" just like everywhere else. (Meaning, it should really be >>> the system default encoding, which defaults to "ascii" and is >>> intentionally hard to change.) >> We're talking about Py3k here: "abc" will be a Unicode string, >> so why restrict the conversion to 7 bits when you can have 8 bits >> without any conversion problems ? > > As Phillip guessed, I was indeed thinking about introducing bytes() > sooner than that, perhaps even in 2.5 (though I don't want anything > rushed). Hmm, that is probably going to be too early. As the thread shows there are lots of things to take into account, esp. since if you plan to introduce byte() in 2.x, the upgrade path to 3.x would have to be carefully planned. Otherwise, we end up introducing a feature which is meant to prepare for 3.x and then we end up causing breakage when the move is finally implemented. > Even in Py3k though, the encoding issue stands -- what if the file > encoding is Unicode? Then using Latin-1 to encode bytes by default > might not by what the user expected. Or what if the file encoding is > something totally different? (Cyrillic, Greek, Japanese, Klingon.) > Anything default but ASCII isn't going to work as expected. ASCII > isn't going to work as expected either, but it will complain loudly > (by throwing a UnicodeError) whenever you try it, rather than causing > subtle bugs later. I think there's a misunderstanding here: in Py3k, all "string" literals will be converted from the source code encoding to Unicode. There are no ambiguities - a Klingon character will still map to the same ordinal used to create the byte content regardless of whether the source file is encoded in UTF-8, UTF-16 or some Klingon charset (are there any ?). Furthermore, by restricting to ASCII you'd also outrule hex escapes which seem to be the natural choice for presenting binary data in literals - the Unicode representation would then only be an implementation detail of the way Python treats "string" literals and a user would certainly expect to find e.g. \x88 in the bytes object if she writes bytes('\x88'). But maybe you have something different in mind... I'm talking about ways to create bytes() in Py3k using "string" literals. >> While we're at it: I'd suggest that we remove the auto-conversion >> from bytes to Unicode in Py3k and the default encoding along with >> it. > > I'm not sure which auto-conversion you're talking about, since there > is no bytes type yet. If you're talking about the auto-conversion from > str to unicode: the bytes type should not be assumed to have *any* > properties that the current str type has, and that includes > auto-conversion. I was talking about the automatic conversion of 8-bit strings to Unicode - which was a key feature to make the introduction of Unicode less painful, but will no longer be necessary in Py3k. >> In Py3k the standard lib will have to be Unicode compatible >> anyway and string parser markers like "s#" will have to go away >> as well, so there's not much need for this anymore. >> >> (Maybe a bit radical, but I guess that's what Py3k is meant for.) > > Right. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 14 2006) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 14, 2006, at 11:47 AM, M.-A. Lemburg wrote: > The above approach would basically remove the possibility to easily > create bytes() from literals in Py3k, since literals in Py3k create > Unicode objects, e.g. bytes("123") would not work in Py3k. That is true. And I think that is correct. There should be b"string" syntax. > It's hard to imagine how you'd provide a decent upgrade path > for bytes() if you introduce the above semantics in Py2.x. > > People would start writing bytes("123") in Py2.x and expect > it to also work in Py3k, which it wouldn't. Agreed, it won't work. > To prevent this, you'd have to outrule bytes() construction > from strings altogether, which doesn't look like a viable > option either. I don't think you have to do that, you just have to provide b"string". I'd like to point out that the previous proposal had the same issue: On Feb 13, 2006, at 8:11 PM, Guido van Rossum wrote: > On 2/13/06, James Y Knight <[EMAIL PROTECTED]> wrote: >> In py3k, when the str object is eliminated, then what do you have? >> Perhaps >> - bytes("\x80"), you get an error, encoding is required. There is no >> such thing as "default encoding" anymore, as there's no str object. >> - bytes("\x80", encoding="latin-1"), you get a bytestring with a >> single byte of value 0x80. >> > > Yes to both again. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 14, 2006, at 11:25 AM, Phillip J. Eby wrote: > At 11:08 AM 2/14/2006 -0500, James Y Knight wrote: >> I like it, it makes sense. Unicode strings are simply not allowed as >> arguments to the byte constructor. Thinking about it, why would it be >> otherwise? And if you're mixing str-strings and unicode-strings, that >> means the str-strings you're sometimes giving are actually not byte >> strings, but character strings anyhow, so you should be encoding >> those too. bytes(s_or_U.encode('utf-8')) is a perfectly good >> spelling. > Actually, I think you mean: > > if isinstance(s_or_U, str): > s_or_U = s_or_U.decode('utf-8') > > b = bytes(s_or_U.encode('utf-8')) > > Or maybe: > > if isinstance(s_or_U, unicode): > s_or_U = s_or_U.encode('utf-8') > > b = bytes(s_or_U) > > Which is why I proposed that the boilerplate logic get moved *into* > the bytes constructor. I think this use case is going to be common > in today's Python, but in truth I'm not as sure what bytes() will > get used *for* in today's Python. I'm probably overprojecting > based on the need to use str objects now, but bytes aren't going to > be a replacement for str for a good while anyway. I most certainly *did not* mean that. If you are mixing together str and unicode instances, the str instances _must be_ in the default encoding (ascii). Otherwise, you are bound for failure anyhow, e.g. ''.join(['\x95', u'1']). Str is used for two things right now: 1) a byte string. 2) a unicode string restricted to 7bit ASCII. These two uses are separate and you cannot mix them without causing disaster. You've created an interface which can take either a utf8 byte-string, or unicode character string. But that's wrong and can only cause problems. It should take either an encoded bytestring, or a unicode character string. Not both. If it takes a unicode character string, there are two ways of spelling that in current python: a "str" object with only ASCII in it, or a "unicode" object with arbitrary characters in it. bytes(s_or_U.encode('utf-8')) works correctly with both. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote: > On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote: > > "\x80".encode('latin-1') > > But in 2.5 we can't change that to return a bytes object without > creating HUGE incompatibilities. People could spell it bytes(s.encode('latin-1')) in order to make it work in 2.X. That spelling would provide a way of ensuring the type of the return value. > You missed the part where I said that introducing the bytes type > *without* a literal seems to be a good first step. A new type, even > built-in, is much less drastic than a new literal (which requires > lexer and parser support in addition to everything else). Are you concerned about the implementation effort? If so, I don't think that's justified since adding a new string prefix should be pretty straightforward (relative to rest of the effort involved). Are you comfortable with the proposed syntax? Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > >>In py3k, when the str object is eliminated, then what do you have? > >>Perhaps > >>- bytes("\x80"), you get an error, encoding is required. There is no > >>such thing as "default encoding" anymore, as there's no str object. > >>- bytes("\x80", encoding="latin-1"), you get a bytestring with a > >>single byte of value 0x80. > > > > Yes to both again. > > Please reconsider, and don't give bytes() an encoding= argument. > It doesn't need one. In Python 3, people should write > > "\x80".encode("latin-1") > > if they absolutely want to, although they better write > > bytes([0x80]) > > Now, the first form isn't valid in 2.5, but > > bytes(u"\x80".encode("latin-1")) > > could work in all versions. In 3.0, I agree that .encode() should return a bytes object. I'd almost be convinced that in 2.x bytes() doesn't need an encoding argument, except it will require excessive copying. bytes(u.encode("utf8")) will certainly use 2*len(u) bytes space (plus a constant); bytes(u, "utf8") only needs len(u) bytes. In 3.0, bytes(s.encode(xxx)) would also create an extra copy, since the bytes type is mutable (we all agree on that, don't we?). I think that's a good enough argument for 2.x. We could keep the extended API as an alternative form in 3.x, or automatically translate calls to bytes(x, y) into x.encode(y). BTW I think we'll need a new PEP instead of PEP 332. The latter has almost no details relevant to this discussion, and it seems to treat bytes as a near-synonym for str in 2.x. That's not the way this discussion is going it seems. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Thomas Wouters <[EMAIL PROTECTED]> wrote: > On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote: > > > But adding an encoding doesn't help. The str.encode() method always > > assumes that the string itself is ASCII-encoded, and that's not good > > enough: > > > >>> "abc".encode("latin-1") > > 'abc' > > >>> "abc".decode("latin-1") > > u'abc' > > >>> "abc\xf0".decode("latin-1") > > u'abc\xf0' > > >>> "abc\xf0".encode("latin-1") > > Traceback (most recent call last): > > File "", line 1, in ? > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position > > 3: ordinal not in range(128) (Note that I've since been convinced that bytes(s) where type(s) == str should just return a bytes object containing the same bytes as s, regardless of encoding. So basically you're preaching to the choir now. The only remaining question is what if anything to do with an encoding argment when the first argument is of type str...) > These comments disturb me. I never really understood why (byte) strings grew > the 'encode' method, since 8-bit strings *are already encoded*, by their > very nature. I mean, I understand it's useful because Python does > non-unicode encodings like 'hex', but I don't really understand *why*. The > benefits don't seem to outweigh the cost (but that's hindsight.) It may also have something to do with Jython compatibility (which has str and unicode being the same thing) or 3.0 future-proofing. > Directly encoding a (byte) string into a unicode encoding is mostly useless, > as you've shown. The only use-case I can think of is translating ASCII in, > for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op, > unless the system encoding isn't 'ascii' (and that's pretty rare, and not > something a Python programmer should depend on.) On the other hand, the fact > that (byte) strings have an 'encode' method creates a lot of confusion in > unicode-newbies, and causes programs to break only when input is non-ASCII. > And non-ASCII input just happens too often and too unpredictably in > 'real-world' code, and not enough in European programmers' tests ;P Oh, there are lots of ways that non-ASCII input can break code, you don't have to invoke encode() on str objects to get that effect. :/ > Unicode objects and strings are not the same thing. We shouldn't treat them > as the same thing. Well in 3.0 they *will* be the same thing, and in Jython they already are. > They share an interface (like lists and tuples do), and > if you only use that interface, treating them as the same kind object is > mostly ok. They actually share *less* of an interface than lists and tuples, > though, as comparing strings to unicode objects can raise an exception, > whereas comparing lists to tuples is not expected to. No, it causes silent surprises since [1,2,3] != (1,2,3). > For anything less > trivial than indexing, slicing and most of the string methods, and anything > what so ever involving non-ASCII (or, rather, non-system-encoding), unicode > objects and strings *must* be treated separately. For instance, there is no > correct way to do: > > s.split("\x80") > > unless you know the type of 's'. If it's unicode, you want u"\x80" instead > of "\x80". If it's not unicode, splitting "\x80" may not even be sensible, > but you wouldn't know from looking at the code -- maybe it expects a > specific encoding (or encoding family), maybe not. As soon as you deal with > unicode, you need to really understand the concept, and too many programmers > don't. And it's very hard to tell from someone's comments whether they fail > to understand or just get some of the terminology wrong; that's why Guido's > comments about 'encoding a byte string' and 'what if the file encoding is > Unicode' scare me. The unicode/string mixup almost makes me wish Python > was statically typed. I'm mostly trying to reflect various broken mental models that users may have. Believe me, my own confusion is nothing compared to the confusion that occurs in less gifted users. :-) The only use case for mixing ASCII and Unicode that I *wanted* to work right was the mixing of pure ASCII strings (typically literals) with Unicode data. And that works. Where things unfortunately fall flat is when you start reading data from files or interactive input and it gives you some encoded str object instead of a Unicode object. Our mistake was that we didn't foresee this clearly enough. Perhaps open(filename).read(), where the file contains non-ASCII bytes, should have been changed to either return a Unicode string (if an encoding can somehow be guessed), or raise an exception, rather than returning an str object in some unknown (and usually unknowable) encoding. I hope to fix that in 3.0 too, BTW. > So please, please, please don't make the mistake of 'doing something' with > the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string. > It wouldn't actually be usable except for the same things as 'str
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Adam Olsen <[EMAIL PROTECTED]> wrote: > I'm starting to wonder, do we really need anything fancy? Wouldn't it > be sufficient to have a way to compactly store 8-bit integers? > > In 2.x we could convert unicode like this: > bytes(ord(c) for c in u"It's...".encode('utf-8')) Yuck. > u"It's...".byteencode('utf-8') # Shortcut for above Yuck**2. I'd like to avoid adding new APIs to existing types to return bytes instead of str. (It's okay to change existing APIs to *accept* bytes as an alternative to str though.) > In 3.0 it changes to: > "It's...".encode('utf-8') > u"It's...".byteencode('utf-8') # Same as above, kept for compatibility No. 3.0 won't have "backward compatibility" features. That's the whole point of 3.0. > Passing a str or unicode directly to bytes() would be an error. > repr(bytes(...)) would produce bytes([1,2,3]). I'm fine with that. > Probably need a __bytes__() method that print can call, or even better > a __print__(file) method[0]. The write() methods would of course have > to support bytes objects. Right on the latter. > I realize it would be odd for the interactive interpret to print them > as a list of ints by default: > >>> u"It's...".byteencode('utf-8') > [73, 116, 39, 115, 46, 46, 46] No. This prints the repr() which should include the type. bytes([73, 116, 39, 115, 46, 46, 46]) is the right thing to print here. > But maybe it's time we stopped hiding the real nature of bytes from users? That's the whole point. > [0] By this I mean calling objects recursively and telling them what > file to print to, rather than getting a temporary string from them and > printing that. I always wondered why you could do that from C > extensions but not from Python code. I want to keep the Python-level API small. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Barry Warsaw <[EMAIL PROTECTED]> wrote: > This makes me think I want an unsigned byte type, which b[0] would > return. In another thread I think someone mentioned something about > fixed width integral types, such that you could have an object that > was guaranteed to be 8-bits wide, 16-bits wide, etc. Maybe you also > want signed and unsigned versions of each. This may seem like YAGNI > to many people, but as I've been working on a tightly embedded/ > extended application for the last few years, I've definitely had > occasions where I wish I could more closely and more directly model > my C values as Python objects (without using the standard workarounds > or writing my own C extension types). So I'm taking that the specific properties you want to model are the overflow behavior, right? N-bit unsigned is defined as arithmethic mod 2**N; N-bit signed is a bit more tricky to define but similar. These never overflow but instead just throw away bits in an exactly specified manner (2's complement arithmetic). While I personally am comfortable with writing (x+y) & 0x (for 16-bit unsigned), I can see that someone who spends a lot of time doing arithmetic in this field might want specialized types. But I'm not sure that that's what the Numeric folks want -- I believe they're more interested in saving space, not in the mod 2**N properties. So (here I'm to some extent guessing) they have different array types whose elements are ints or floats of various widths; I'm guessing they also have scalars of those widths for consistency or to guide the creation of new arrays from scalars. I wouldn't be surprised if, rather than requiring N-bit 2's complement, they would prefer more flexible control over overflow -- e.g. ignore, warn, error, turn into NaN, etc. > But anyway, without hyper-generalizing, it's still worth asking > whether a bytes type is just a container of byte objects, where the > contained objects would be distinct, fixed 8-bit unsigned integral > types. There's certainly a point to treating bytes as ints; I don't know if it's more compelling than to treating them as unit bytes. But if we decide that the bytes types contains ints, b[0] should return a plain int (whose value necessarily is in range(0, 256)), not some new unsigned-8-bit type. And creating a bytes object from a list of ints should accept any input values as long as their __index__ value is in that same range. I.e. bytes([1, 2L]) should be the same as bytes([1L, 2]); and bytes([-1]) should raise a ValueError. > > There's also the consideration for APIs that, informally, accept > > either a string or a sequence of objects. Many of these exist, and > > they are probably all being converted to support unicode as well as > > str (if it makes sense at all). Should a bytes object be considered as > > a sequence of things, or as a single thing, from the POV of these > > types of APIs? Should we try to standardize how code tests for the > > difference? (Currently all sorts of shortcuts are being taken, from > > isinstance(x, (list, tuple)) to isinstance(x, basestring).) > > I think bytes objects are very much like string objects today -- > they're the photons of Python since they can act like either > sequences or scalars, depending on the context. For example, we have > code that needs to deal with situations where an API can return > either a scalar or a sequence of those scalars. So we have a utility > function like this: > > def thingiter(obj): > try: > it = iter(obj) > except TypeError: > yield obj > else: > for item in it: > yield item > > Maybe there's a better way to do this, but the most obvious problem > is that (for our use cases), this fails for strings because in this > context we want strings to act like scalars. So we add a little test > just before the "try:" like "if isinstance(obj, basestring): yield > obj". But that's yucky. > > I don't know what the solution is -- if there /is/ a solution short > of special case tests like above, but I think the key observation is > that sometimes you want your string to act like a sequence and > sometimes you want it to act like a scalar. I suspect bytes objects > will be the same way. I agree it's icky, and I'd rather not design APIs like that -- but I can't help it that others continue to want to use that idiom. I also agree that most likely we'll want to treat bytes the same as strings here. But no basestring (bytes are mutable and don't behave like sequences of characters). -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Adam Olsen <[EMAIL PROTECTED]> wrote: > What would that imply for repr()? To support eval(repr(x)) it would > have to produce whatever format the source code includes to begin > with. I'm not sure that's a requirement. (I do think that in 2.x, str(bytes(s)) == s should hold as long as type(s) == str.) > If I understand correctly there's three main candidates: > 1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x I'm not sure what you mean, but I'm guessing you're thinking that the repr() of a bytes object created from bytes('abc\xf0') would be bytes('abc\xf0') under this rule. What's so bad about that? > 2. Direct copying to str/unicode if it's only ascii values, switching > to a list of hex literals if there's any non-ascii values That works for me too. But why hex literals? As MvL stated, a list of decimals would be just as useful. > 3. b"foo" literal with ascii for all ascii characters (other than \ > and "), \xFF for individual characters that aren't ascii > > Given the choice I prefer the third option, with the second option as > my runner up. The first option just screams "silent errors" to me. The 3rd is out of the running for many reasons. I'm not sure I understand your "silent errors" fear; can you elaborate? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote: > >On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > > > I didn't mean that it was the only purpose. In Python 2.x, practical code > > > has to sometimes deal with "string-like" objects. That is, code that > > > takes > > > either strings or unicode. If such code calls bytes(), it's going to want > > > to include an encoding so that unicode conversions won't fail. > > > >That sounds like a rather hypothetical example. Have you thought it > >through? Presumably code that accepts both str and unicode either > >doesn't care about encodings, but simply returns objects of the same > >type as the arguments -- and then it's unlikely to want to convert the > >arguments to bytes; or it *does* care about encodings, and then it > >probably already has to special-case str vs. unicode because it has to > >control how str objects are interpreted. > > Actually, it's the other way around. Code that wants to output > uninterpreted bytes right now and accepts either strings or Unicode has to > special-case *unicode* -- not str, because str is the only "bytes type" we > currently have. But this is assuming that the str input is indeed uninterpreted bytes. That may be a tacit assumption or agreement but it may be wrong. Also, there are many ways to interpret "uninterpreted bytes" -- is it an image, a sound file, or UTF-8 text? In 2 out of those 3, passing unicode is more likely a bug than anything else (except in Jython). > This creates an interesting issue in WSGI for Jython, which of course only > has one (unicode-based) string type now. Since there's no bytes type in > Python in general, the only solution we could come up with was to treat > such strings as latin-1: I believe that's the general convention in Jython, as it matches the default (albeit deprecated) conversion between bytes and characters in Java itself. > http://www.python.org/peps/pep-0333.html#unicode-issues > > This is why I'm biased towards latin-1 encoding of unicode to bytes; it's > "the same thing" as an uninterpreted string of bytes. But in CPython this is not how this is generally done. > I think the difference in our viewpoints is that you're still thinking > "string" thoughts, whereas I'm thinking "byte" thoughts. Bytes are just > bytes; they don't *have* an encoding. I think when one side of the equation is Unicode, in CPython, I can be forgiven for thinking string thoughts, since Unicode is never used to carry binary bytes in CPython. You may have to craft some kind of different rule for Jython; it doesn't have a default encoding used when str meets unicode. > So, if you think of "converting a string to bytes" as meaning "create an > array of numerals corresponding to the characters in the string", then this > leads to a uniform result whether the characters are in a str or a unicode > object. In other words, to me, bytes(str_or_unicode) should be treated as: > > bytes(map(ord, str_or_unicode)) > > In other words, without an encoding, bytes() should simply treat str and > unicode objects *as if they were a sequence of integers*, and produce an > error when an integer is out of range. This is a logical and consistent > interpretation in the absence of an encoding, because in that case you > don't care about the encoding - it's just raw data. I see your point (now that you mentioned Jython). But I still don't think that this is a good default for CPython. > If, however, you include an encoding, then you're stating that you want to > encode the *meaning* of the string, not merely its integer values. Note that in Python 3000 we won't be using str/unicode to carry integer values around, since we will have the bytes type. So there, it makes sense to think of the conversion to always involve an encoding, possibly a default one. (And I think the default might more usefully be UTF-8 then.) > >What would bytes("abc\xf0", "latin-1") *mean*? Take the string > >"abc\xf0", interpret it as being encoded in XXX, and then encode from > >XXX to Latin-1. But what's XXX? As I showed in a previous post, > >"abc\xf0".encode("latin-1") *fails* because the source for the > >encoding is assumed to be ASCII. > > I'm saying that XXX would be the same encoding as you specified. i.e., > including an encoding means you are encoding the *meaning* of the string. That would be the same as ignoring the encoding argument when the input is str in CPython 2.x, right? I believe we started out saying we didn't want to ignore the encoding. Perhaps we need to reconsider that, given the Jython requirement? Then code that converts str to bytes and needs to be portable between Jython and CPython could write b = bytes(s, "latin-1") > However, I believe I mainly proposed this as an alternative to having > bytes(str_or_unicode) work like bytes(map(ord,str_or_unicode)), which I > think is probably a saner default. Sorry, i still don't buy that
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Barry Warsaw <[EMAIL PROTECTED]> wrote: > A related question: what would bytes([104, 101, 108, 108, 111, 8004]) > return? An exception hopefully. Absolutely. > I also think you'd want bytes([x > for x in some_bytes_object]) to return an object equal to the original. You mean if types(some_bytes_object) is bytes? Yes. But that doesn't constrain the API much. Anyway, I'm now convinced that bytes should act as an array of ints, where the ints are restricted to range(0, 256) but have type int. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > As Phillip guessed, I was indeed thinking about introducing bytes() > > sooner than that, perhaps even in 2.5 (though I don't want anything > > rushed). > > Hmm, that is probably going to be too early. As the thread shows > there are lots of things to take into account, esp. since if you > plan to introduce bytes() in 2.x, the upgrade path to 3.x would > have to be carefully planned. Otherwise, we end up introducing > a feature which is meant to prepare for 3.x and then we end up > causing breakage when the move is finally implemented. You make a good point. Someone probably needs to write up a new PEP summarizing this discussion (or rather, consolidating the agreement that is slowly emerging, where there is agreement, and summarizing the key open questions). > > Even in Py3k though, the encoding issue stands -- what if the file > > encoding is Unicode? Then using Latin-1 to encode bytes by default > > might not by what the user expected. Or what if the file encoding is > > something totally different? (Cyrillic, Greek, Japanese, Klingon.) > > Anything default but ASCII isn't going to work as expected. ASCII > > isn't going to work as expected either, but it will complain loudly > > (by throwing a UnicodeError) whenever you try it, rather than causing > > subtle bugs later. > > I think there's a misunderstanding here: in Py3k, all "string" > literals will be converted from the source code encoding to > Unicode. There are no ambiguities - a Klingon character will still > map to the same ordinal used to create the byte content regardless > of whether the source file is encoded in UTF-8, UTF-16 or > some Klingon charset (are there any ?). OK, so a string (literal or otherwise) containing a Klingon character won't be acceptable to the bytes() constructor in 3.0. It shouldn't be in 2.x either then. I still think that someone who types a file in Latin-1 and enters non-ASCII Latin-1 characters in a string literal and then passes it to the bytes() constructor might expect to get bytes encoded in Latin-1, and someone who types a file in UTF-8 and enters non-ASCII Unicode characters might expect to get UTF-8-encoded bytes. Since they can't both get what they want, we should disallow both, and only allow ASCII. > Furthermore, by restricting to ASCII you'd also outrule hex escapes > which seem to be the natural choice for presenting binary data in > literals - the Unicode representation would then only be an > implementation detail of the way Python treats "string" literals > and a user would certainly expect to find e.g. \x88 in the bytes object > if she writes bytes('\x88'). I guess we'l just have to disappoint her. Too bad for the person who wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO and certainly easier than a *mixture* of hex and ASCII like '\xabc\xdef'. > But maybe you have something different in mind... I'm talking > about ways to create bytes() in Py3k using "string" literals. I'm not sure that's going to be common practive except for ASCII characters used in network protocols. > >> While we're at it: I'd suggest that we remove the auto-conversion > >> from bytes to Unicode in Py3k and the default encoding along with > >> it. > > > > I'm not sure which auto-conversion you're talking about, since there > > is no bytes type yet. If you're talking about the auto-conversion from > > str to unicode: the bytes type should not be assumed to have *any* > > properties that the current str type has, and that includes > > auto-conversion. > > I was talking about the automatic conversion of 8-bit strings to > Unicode - which was a key feature to make the introduction of > Unicode less painful, but will no longer be necessary in Py3k. OK. The bytes type certainly won't have this property. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote: > People could spell it bytes(s.encode('latin-1')) in order to make it > work in 2.X. That spelling would provide a way of ensuring the type > of the return value. At the cost of an extra copying step. [Guido] > > You missed the part where I said that introducing the bytes type > > *without* a literal seems to be a good first step. A new type, even > > built-in, is much less drastic than a new literal (which requires > > lexer and parser support in addition to everything else). > > Are you concerned about the implementation effort? If so, I don't > think that's justified since adding a new string prefix should be > pretty straightforward (relative to rest of the effort involved). Not so much the implementation but also the documentation, updating 3rd party Python preprocessors, etc. > Are you comfortable with the proposed syntax? Not entirely, since I don't know what b"abcdef" would mean (where is a Unicode Euro character typed in whatever source encoding was used). Instead of b"abc" (only ASCII) you could write bytes("abc"). Instead of b"\xf0\xff\xee" you could write bytes([0xf0, 0xff, 0xee]). The key disconnect for me is that if bytes are not characters, we shouldn't use a literal notation that resembles the literal notation for characters. And there's growing consensus that a bytes type should be considered as an array of (8-bit unsigned) ints. Also, bytes objects are (in my mind anyway) mutable. We have no other literal notation for mutable objects. What would the following code print? for i in range(2): b = b"abc" print b b[0] = ord("A") Would the second output line print abc or Abc? I guess the only answer that makes sense is that it should print abc both times; but that means that b"abc" must be internally implemented by creating a new bytes object each time. Perhaps the implementation effort isn't so minimal after all... (PS why is there a reply-to in your email the excludes you from the list of recipients but includes me?) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Tue, 2006-02-14 at 15:13 -0800, Guido van Rossum wrote: > So I'm taking that the specific properties you want to model are the > overflow behavior, right? N-bit unsigned is defined as arithmethic mod > 2**N; N-bit signed is a bit more tricky to define but similar. These > never overflow but instead just throw away bits in an exactly > specified manner (2's complement arithmetic). That would be my use case, yep. > While I personally am comfortable with writing (x+y) & 0x (for > 16-bit unsigned), I can see that someone who spends a lot of time > doing arithmetic in this field might want specialized types. I'd put it in the "annoying, although there exists a workaround that might confound newbies" category. Which means it's definitely not urgent enough to address for 2.5 -- if ever -- especially given your current stance on bytes(bunch_of_ints)[0]. The two are of course separate issues, but thinking about one lead to the other. > But I'm not sure that that's what the Numeric folks want -- I believe > they're more interested in saving space, not in the mod 2**N > properties. Could be. I don't care about space savings. And I definitely have no clue what the Numeric folks want. ;) > There's certainly a point to treating bytes as ints; I don't know if > it's more compelling than to treating them as unit bytes. But if we > decide that the bytes types contains ints, b[0] should return a plain > int (whose value necessarily is in range(0, 256)), not some new > unsigned-8-bit type. And creating a bytes object from a list of ints > should accept any input values as long as their __index__ value is in > that same range. > > I.e. bytes([1, 2L]) should be the same as bytes([1L, 2]); and > bytes([-1]) should raise a ValueError. That seems fine to me. > I agree it's icky, and I'd rather not design APIs like that -- but I > can't help it that others continue to want to use that idiom. I also > agree that most likely we'll want to treat bytes the same as strings > here. But no basestring (bytes are mutable and don't behave like > sequences of characters). That's interesting. So bytes really behave a lot more like some weird string/lists hybrid then? It makes some sense. You read 801 bytes from a binary file, twiddle bytes 223 and 741 and then write those bytes back out to a different binary file. If we don't inherit from basestring, what I'm worried about is that for those who do continue to use the idiom described previously, we'll have to extend our isinstance() to include both basestring and bytes. Which definitely gets ickier. But if bytes are mutable, as make sense, then it also makes sense that they don't inherit from basestring. BTW, using that idiom is a bit of a hedge against such API (which you may not control). It allows us to say "okay, at /this/ point I don't know whether I have a scalar or a sequence, but from this point forward, I know I have something I can safely iterate over." I wonder if it makes sense to add a more fundamental abstract base class that can be used as a marker for "photonic behavior". I don't know what that class would be called, but you'd then have a hierarchy like this: photonic basestring str unicode bytes OTOH, it seems like a lot to add for a specialized (and some would say dubious) use case. -Barry signature.asc Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 2/14/06, Neil Schemenauer wrote: > > People could spell it bytes(s.encode('latin-1')) in order to make it > > work in 2.X. > > Guido wrote: > > At the cost of an extra copying step. > > That sounds like an implementation issue. If it is important > enough to matter, then why not just add some smarts to the > bytes constructor? Short answer: you can't. > If the argument is a str, and the constructor owns the only > reference, then go ahead and use the argument's own > underlying array; the string itself will be deallocated when > (or before) the constructor returns, so no one else can use > it expecting an immutable. Hard to explain, but the VM usually keeps an extra reference on the stack so the refcount is never 1. But you can't rely on that so assuming that it's safe to reuse the storage if it's >1. Also, since the str's underlying array is allocated inline with the str header, this require str and bytes to have the same object layout. But since bytes are mutable, they can't. Summary: you don't understand the implementation well enough to suggest these kinds of things. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > The only remaining question is what if anything to do with an > encoding argment when the first argument is of type str...) From what you said earlier about str in 2.x being interpretable as a unicode string which contains only ascii, it seems to me that if you say bytes(s, encoding) where s is a str, then by the presence of the encoding argument you're saying that you want s to be treated as unicode and encoded using the specified encoding. So the result should be the same as bytes(u, encoding) where u is a unicode string containing the same code points as s. This implies that it should be an error if s contains non-ascii characters. This interpretation would satisfy the requirement for a single call signature covering both unicode and str-used-as-ascii-characters, while providing a different call signature (without encoding) for str-used-as-bytes. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > >>At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote: >> >>>On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: >>> >>>What would bytes("abc\xf0", "latin-1") *mean*? >> >>I'm saying that XXX would be the same encoding as you specified. i.e., >>including an encoding means you are encoding the *meaning* of the string. No, this is wrong. As I understand it, the encoding argument to bytes() is meant to specify how to *encode* characters into the bytes object. If you want to be able to specify how to *decode* a str argument as well, you'd need a third argument. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Greg Ewing wrote: > Guido van Rossum wrote: >> On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: >> >>> At 04:29 PM 2/13/2006 -0800, Guido van Rossum wrote: >>> On 2/13/06, Phillip J. Eby <[EMAIL PROTECTED]> wrote: What would bytes("abc\xf0", "latin-1") *mean*? >>> I'm saying that XXX would be the same encoding as you specified. i.e., >>> including an encoding means you are encoding the *meaning* of the string. > > No, this is wrong. As I understand it, the encoding > argument to bytes() is meant to specify how to *encode* > characters into the bytes object. If you want to be able > to specify how to *decode* a str argument as well, you'd > need a third argument. I'm not sure I understand why this would be needed? But maybe it's still too early to pin anything down. My first impression and thoughts were: (and seems incorrect now) bytes(object) -> byte sequence of objects value Basically a "memory dump" of objects value. And so... object(bytes) -> copy of original object This would reproduce a copy of the original object as long as the from and to object are the same type with no encoding needed. If they are different then you would get garbage, or an error. But that would be a programming error and not a language issue. It would be up to the programmer to not do that. Of course this is one of those easier to say than do concepts I'm sure. And I was thinking a bytes argument of more than one item would indicate a byte sequence. bytes(1,2,3) -> bytes([1,2,3]) Where any values above 255 would give an error, but it seems an explicit list is preferred. And that's fine because it creates a way for bytes to know how to handle everything else. (I think) bytes([1,2,3]] -> bytes[(1,2,3)] Which is fine... so ??? b = bytes(0L) -> bytes([0,0,0,0]) long(b) -> 0Lconvert it back to 0L And ... b = bytes([0L]) -> bytes([0]) # a single byte int(b) -> 0convert it back to 0 long(b) -> 0L It's up to the programmer to know if it's safe. Working with raw data is always a programmer needs to be aware of what's going on thing. But would it be any different with strings? You wouldn't ever want to encode one type's bytes into a different type directly. It would be better to just encode it back to the original type, then use *it's* encoding method to change it. so... b = bytes(s) -> bytes( raw sequence of bytes ) Weather or not you get a single byte per char or multiple bytes per character would depend on the strings encoding. s = str(bytes, encoding) -> original string You need to specify it here, because there is more than one sting encoding. To avoid encodings entirely we would need a type for each encoding. (which isn't really avoiding anything) And it's the "raw data so programmer needs to be aware" situation again. Don't decode to something other than what it is. If someone needs automatic encoding/decoding, then they probably should write a class to do what they want. Something roughly like... class bytekeeper(object): b = None t = None e = None def __init__(self, obj, enc='bytes') # or whatever encoding self.e = enc self.t = type(obj) self.b = bytes(obj) def decode(self): ... Would we be able to subclass bytes? class bytekeeper(bytes): ? ... Ok.. enough rambling... I wonder how much of this is way out in left field. ;) cheers, Ronald Adam And as fa In this case the encoding argument would only be needed not to ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 2/14/06, Adam Olsen <[EMAIL PROTECTED]> wrote: > > In 3.0 it changes to: > > "It's...".encode('utf-8') > > u"It's...".byteencode('utf-8') # Same as above, kept for compatibility > > No. 3.0 won't have "backward compatibility" features. That's the whole > point of 3.0. Conceded. > > I realize it would be odd for the interactive interpret to print them > > as a list of ints by default: > > >>> u"It's...".byteencode('utf-8') > > [73, 116, 39, 115, 46, 46, 46] > > No. This prints the repr() which should include the type. bytes([73, > 116, 39, 115, 46, 46, 46]) is the right thing to print here. Typo, sorry :) -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 2/13/06, Adam Olsen <[EMAIL PROTECTED]> wrote: > > If I understand correctly there's three main candidates: > > 1. Direct copying to str in 2.x, pretending it's latin-1 in unicode in 3.x > > I'm not sure what you mean, but I'm guessing you're thinking that the > repr() of a bytes object created from bytes('abc\xf0') would be > > bytes('abc\xf0') > > under this rule. What's so bad about that? See below. > > 2. Direct copying to str/unicode if it's only ascii values, switching > > to a list of hex literals if there's any non-ascii values > > That works for me too. But why hex literals? As MvL stated, a list of > decimals would be just as useful. PEBKAC. Yeah, decimals are simpler and shorter even. > > 3. b"foo" literal with ascii for all ascii characters (other than \ > > and "), \xFF for individual characters that aren't ascii > > > > Given the choice I prefer the third option, with the second option as > > my runner up. The first option just screams "silent errors" to me. > > The 3rd is out of the running for many reasons. > > I'm not sure I understand your "silent errors" fear; can you elaborate? I think it's that someone will create a unicode object with real latin-1 characters and it'll get passed through without errors, the code assuming it's 8bit-as-latin-1. If they had put other unicode characters in they would have gotten an exception instead. However, at this point all the posts on latin-1 encoding/decoding have become so muddled in my mind that I don't know what they're suggesting. I think I'll wait for the pep to clear that up. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Guido van Rossum <[EMAIL PROTECTED]> wrote: > Not entirely, since I don't know what b"abcdef" would mean > (where is a Unicode Euro character typed in whatever source > encoding was used). SyntaxError I would hope. Ascii and hex escapes only please. :) Although I'm not arguing for or against byte literals. They do make for a much terser form, but they're not strictly necessary. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Ron Adam wrote: > My first impression and thoughts were: (and seems incorrect now) > > bytes(object) -> byte sequence of objects value > > Basically a "memory dump" of objects value. As I understand the current intentions, this is correct. The bytes constructor would have two different signatures: (1) bytes(seq) --> interprets seq as a sequence of integers in the range 0..255, exception otherwise (2a) bytes(str, encoding) --> encodes the characters of (2b) bytes(unicode, encoding) the string using the specified encoding In (2a) the string would be interpreted as containing ascii characters, with an exception otherwise. In 3.0, (2a) will disappear leaving only (1) and (2b). > And I was thinking a bytes argument of more than one item would indicate > a byte sequence. > > bytes(1,2,3) -> bytes([1,2,3]) But then you have to test the argument in the one-argument case and try to guess whether it should be interpreted as a sequence or an integer. Best to avoid having to do that. > Which is fine... so ??? > > b = bytes(0L) -> bytes([0,0,0,0]) No, bytes(0L) --> TypeError because 0L doesn't implement the iterator protocol or the buffer interface. I suppose long integers might be enhanced to support the buffer interface in 3.0, but that doesn't seem like a good idea, because the bytes you got that way would depend on the internal representation of long integers. In particular, bytes(0x12345678L) via the buffer interface would most likely *not* give you bytes[0x12, 0x34, 0x56, 0x78]). Maybe types should grow a __bytes__ method? Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
> "M" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes: M> James Y Knight wrote: >> Nice and simple. M> Albeit, too simple. M> The above approach would basically remove the possibility to M> easily create bytes() from literals in Py3k, since literals in M> Py3k create Unicode objects, e.g. bytes("123") would not work M> in Py3k. No, it just rules out a builtin easy way to create bytes() from literals. But who needs to do that? codec writers and people implementing wire protocols with bytes() that look like character strings but aren't. OK, so this makes life hard on codec writers. But those implementing wire protocols can use existing codecs, presumably 'ascii' will do 99% of the time: def make_wire_token (unicode_string, encoding='ascii'): return bytes(unicode_string.encode(encoding)) Everybody else is just asking for trouble by using bytes() for character strings. It would really be desirable to have "string" be a Unicode literal in Py3k, and u"string" a syntax error. M> To prevent [people from learning to write "bytes('string')" in M> 2.x and expecting that to work in Py3k], you'd have to outrule M> bytes() construction from strings altogether, which doesn't M> look like a viable option either. Why not? Either bytes() are the same as strings, in which case why change the name? or they're not, in which case we ask people to jump through the required hoops to create them. Maybe I'm missing some huge use case, of course, but it looks to me like the use cases are pretty specialized, and are likely to involve explicit coding anyway. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Guido van Rossum wrote: > If bytes support the buffer interface, we get another interesting > issue -- regular expressions over bytes. Brr. We already have that: >>> import re, array >>> re.search('\2', array.array('B', [1, 2, 3, 4])).group() array('B', [2]) >>> Not sure whether to blame array or re, though... Just ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Tue, 14 Feb 2006 12:31:07 -0700, Neil Schemenauer <[EMAIL PROTECTED]> wrote: >On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote: >> On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote: >> > "\x80".encode('latin-1') >> >> But in 2.5 we can't change that to return a bytes object without >> creating HUGE incompatibilities. > >People could spell it bytes(s.encode('latin-1')) in order to make it >work in 2.X. That spelling would provide a way of ensuring the type >of the return value. UIAM spelling it bytes(map(ord, s)) or bytes(s) # (bytes would do above internally) would work for str or unicode and would be forward compatible. or bytes(s, encoding_name) # if standard mapping is not desired BTW, ord(u'x') has the effect of u'x'.encode('latin-1') Note: >>> s256 = ''.join(chr(i) for i in xrange(256)) >>> assert s256.decode('latin-1') == u''.join(unichr(ord(c)) for c in s256) >>> assert map(ord, s256.decode('latin-1')) == map(ord, s256) == range(256) But this does *not* mean bytes has an implicit encoding!! It just means there is a useful 1:1 mapping between the possible bytes values and the first 256 unicode *characters*, remembering that the latter are *characters* quite apart from whatever encoding the code source may have. This is a nice safe 1:1 abstract correspondence ISTM. > >> You missed the part where I said that introducing the bytes type >> *without* a literal seems to be a good first step. A new type, even >> built-in, is much less drastic than a new literal (which requires >> lexer and parser support in addition to everything else). > >Are you concerned about the implementation effort? If so, I don't >think that's justified since adding a new string prefix should be >pretty straightforward (relative to rest of the effort involved). >Are you comfortable with the proposed syntax? > I'm -1 on special literal at this point. I think a special text-like literal would be misleading, because it suggests that bytes is somehow in the string family of types, which IMO it really isn't. IMO it's semantically more of a builtin array.array('B'). If we adopt the ord/unichr mappings for strings to/from bytes, and of course init also from a suitable integer sequence, we AGNI, I think. Using non-ascii non-escaped characters in string literals for specifying str ord values (as opposed to characters) is bad practice, but escaped ascii-in-whatever-source-encoding and native_literal_in_source_encoding.decode(source_encoding) seem to work: >>> for enc in 'cp437 latin-1 utf-8'.split(): ... print '\n< %s >'%enc ... print mkretesc(enc, 0xf6)[1].decode(enc) ... print repr(mkretesc(enc, 0xf6)[1]) ... print mkretesc(enc, 0xf6)[0]() ... t = mkretesc(enc, 0xf6)[0]() ... print t[0], t[1], t[2] ... print ... < cp437 > # -*- coding: cp437 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('cp437') "# -*- coding: cp437 -*-\ndef foof6(): return '\\xf6', '\x94', '\x94'.decode('cp437')\n" ('\xf6', '\x94', u'\xf6') ÷ ö ö < latin-1 > # -*- coding: latin-1 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('latin-1') "# -*- coding: latin-1 -*-\ndef foof6(): return '\\xf6', '\xf6', '\xf6'.decode('latin-1')\n" ('\xf6', '\xf6', u'\xf6') ÷ ÷ ö < utf-8 > # -*- coding: utf-8 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('utf-8') "# -*- coding: utf-8 -*-\ndef foof6(): return '\\xf6', '\xc3\xb6', '\xc3\xb6'.decode('utf-8')\n" ('\xf6', '\xc3\xb6', u'\xf6') ÷ +¦ ö The source looks the same viewed as characters, but you can see the differences in the repr values. But the consequence of source-encoding ord values determining str values is that if e.g. you imported this foo function from variously encoded sources, only the escaped and unicode have the proper ord value. The middle one comes from the native literal source encoding. So until str becomes unicode, ascii or ascii escapes are a must for ord-specifying. Afer str becomes unicode, escapes will still work, but the unichr/ord symmetry will allow using the full first 256 unicode characters to specify byte type values if desired. (This happens to correspond to latin-1, but don't mention it ;-) It would make possible a round-trippable repr as bytes('...') using ascii+escaped ascii, and full-256 unicode string literals backwards-compatibly after py3k. Have I missed a pitfall? Hope the output got through to your screen. The first and last in the 3-character lines should always be division sign and umlaut o. The problematical middle ones should be cp437 translations of the middle hex values, since that is the screen I copied from (umluat o, division sign, and plus, vertical_bar for the translation of the utf-8 encoding pair. That one illustrates the problem of returning a "character" encoded in utf-8 thinking single-byte ord value.). BTW, should bytes be freezable? Regards, Bengt Richter _
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Greg Ewing wrote: > Ron Adam wrote: > >> My first impression and thoughts were: (and seems incorrect now) >> >> bytes(object) -> byte sequence of objects value >> >> Basically a "memory dump" of objects value. > > As I understand the current intentions, this is correct. > The bytes constructor would have two different signatures: > > (1) bytes(seq) --> interprets seq as a sequence of > integers in the range 0..255, > exception otherwise > > (2a) bytes(str, encoding) --> encodes the characters of > (2b) bytes(unicode, encoding) the string using the specified >encoding > > In (2a) the string would be interpreted as containing > ascii characters, with an exception otherwise. In 3.0, > (2a) will disappear leaving only (1) and (2b). I was presuming it would be done in C code and it will just need a pointer to the first byte, memchr(), and then read n bytes directly into a new memory range via memcpy(). But I don't know if that's possible with Pythons object model. (My C skills are a bit rusty as well) However, if it's done with a Python iterator and then each item is translated to bytes in a sequence, (much slower), an encoding will need to be known for it to work correctly. Unfortunately Unicode strings don't set an attribute to indicate it's own encoding. So bytes() can't just do encoding = s.encoding to find out, it would need to be specified in this case. And that should give you a byte object that is equivalent to the bytes in memory, providing Python doesn't compress data internally to save space. (?, I don't think it does) I'd prefer the first version *if possible* because of the performance. >> And I was thinking a bytes argument of more than one item would indicate >> a byte sequence. >> >> bytes(1,2,3) -> bytes([1,2,3]) > > But then you have to test the argument in the one-argument > case and try to guess whether it should be interpreted as > a sequence or an integer. Best to avoid having to do that. Yes, I agree. >> Which is fine... so ??? >> >> b = bytes(0L) -> bytes([0,0,0,0]) > > No, bytes(0L) --> TypeError because 0L doesn't implement > the iterator protocol or the buffer interface. It wouldn't need it if it was a direct C memory copy. > I suppose long integers might be enhanced to support the > buffer interface in 3.0, but that doesn't seem like a good > idea, because the bytes you got that way would depend on > the internal representation of long integers. In particular, Since some longs will be of different length, yes a bytes(0L) could give differing results on different platforms, but it will always give the same result on the platform it is run on. I actually think this is a plus and not a problem. If you are using Python to implement a byte interface you need to *know* it is different, not have it hidden. bytesize = len(bytes(0L)) # find how long a long is Cheers, Ronald Adam ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/15/06, Ron Adam <[EMAIL PROTECTED]> wrote: > Greg Ewing wrote: > > Ron Adam wrote: > >> b = bytes(0L) -> bytes([0,0,0,0]) > > > > No, bytes(0L) --> TypeError because 0L doesn't implement > > the iterator protocol or the buffer interface. > > It wouldn't need it if it was a direct C memory copy. > > > I suppose long integers might be enhanced to support the > > buffer interface in 3.0, but that doesn't seem like a good > > idea, because the bytes you got that way would depend on > > the internal representation of long integers. In particular, > > Since some longs will be of different length, yes a bytes(0L) could give > differing results on different platforms, but it will always give the > same result on the platform it is run on. I actually think this is a > plus and not a problem. If you are using Python to implement a byte > interface you need to *know* it is different, not have it hidden. > > bytesize = len(bytes(0L)) # find how long a long is I believe you're confusing a C long with a Python long. A Python long is implemented as an array and has variable size. In any case we already have the struct module: >>> import struct >>> struct.calcsize('l') 4 -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Tue, 14 Feb 2006 15:14:07 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote: >On 2/14/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >> Guido van Rossum wrote: >> > As Phillip guessed, I was indeed thinking about introducing bytes() >> > sooner than that, perhaps even in 2.5 (though I don't want anything >> > rushed). >> >> Hmm, that is probably going to be too early. As the thread shows >> there are lots of things to take into account, esp. since if you >> plan to introduce bytes() in 2.x, the upgrade path to 3.x would >> have to be carefully planned. Otherwise, we end up introducing >> a feature which is meant to prepare for 3.x and then we end up >> causing breakage when the move is finally implemented. > >You make a good point. Someone probably needs to write up a new PEP >summarizing this discussion (or rather, consolidating the agreement >that is slowly emerging, where there is agreement, and summarizing the >key open questions). > >> > Even in Py3k though, the encoding issue stands -- what if the file >> > encoding is Unicode? Then using Latin-1 to encode bytes by default >> > might not by what the user expected. Or what if the file encoding is >> > something totally different? (Cyrillic, Greek, Japanese, Klingon.) >> > Anything default but ASCII isn't going to work as expected. ASCII >> > isn't going to work as expected either, but it will complain loudly >> > (by throwing a UnicodeError) whenever you try it, rather than causing >> > subtle bugs later. >> >> I think there's a misunderstanding here: in Py3k, all "string" >> literals will be converted from the source code encoding to >> Unicode. There are no ambiguities - a Klingon character will still >> map to the same ordinal used to create the byte content regardless >> of whether the source file is encoded in UTF-8, UTF-16 or >> some Klingon charset (are there any ?). > >OK, so a string (literal or otherwise) containing a Klingon character >won't be acceptable to the bytes() constructor in 3.0. It shouldn't be >in 2.x either then. > >I still think that someone who types a file in Latin-1 and enters >non-ASCII Latin-1 characters in a string literal and then passes it to >the bytes() constructor might expect to get bytes encoded in Latin-1, >and someone who types a file in UTF-8 and enters non-ASCII Unicode >characters might expect to get UTF-8-encoded bytes. Since they can't >both get what they want, we should disallow both, and only allow >ASCII. ISTM this is a good rule for backwards compatibility for the '...' => u'...' py3k transition. I don't know if you saw my other post, but I was suggesting that bytes(s_or_u) should be mapped to the integer values by the current definition of ord for either str or unicode. UIAM this works when you convert ASCII and will work if you convert the ASCII string to unicode. It will also let you use unicode _currently_ to get past the ASCII restriction, since ord(u) works for all of the first 256 unicode characters. Using those characters in bytes(u'...') works even if your source encoding is utf-8 and contains ascii escapes, e.g. >>> utfsrc = """\ ... # -*- coding: utf-8 -*- ... umlaut_os, values = u'\xf6\\xf6', map(ord, u'\xf6\\xf6') ... """.decode('latin-1').encode('utf-8') Hopefully showing on your screen properly: >>> print utfsrc.decode('utf-8') # -*- coding: utf-8 -*- umlaut_os, values = u'ö\xf6', map(ord, u'ö\xf6') And the repr, where you can see the utf-8 double chars for utf-8 and the \\xf6 ascii escape: >>> print repr(utfsrc) "# -*- coding: utf-8 -*-\numlaut_os, values = u'\xc3\xb6\\xf6', map(ord, u'\xc3\xb6\\xf6')\n" compiling the utf-8 source and executing it: >>> exec compile(utfsrc,'','exec') Good results: >>> umlaut_os, map(hex, values) (u'\xf6\xf6', ['0xf6', '0xf6']) >>> print umlaut_os öö So map(s_or_u) works predictably now, and will not break after py3k unless you use non-ascii in _plain_ str strings now. But in unicode it should be ok even now. I think ord is a consistent and handy mapping of characters to bytes, and the fact that it works for unicode for all 256 characters seems to me a boon. (So long as no one gets upset that ord(u) _happens_ to match ord(u.encode('latin-1')) ;-) I didn't see yet where you had ruled against ord mapping of unicode to bytes, so I am hopeful that you will consider it. >> Furthermore, by restricting to ASCII you'd also outrule hex escapes >> which seem to be the natural choice for presenting binary data in >> literals - the Unicode representation would then only be an >> implementation detail of the way Python treats "string" literals >> and a user would certainly expect to find e.g. \x88 in the bytes object >> if she writes bytes('\x88'). > >I guess we'l just have to disappoint her. Too bad for the person who >wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to >write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO >and certainly easier than a *mixture* of hex and ASCII like >'\xabc\xdef'. > >>
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On 2/14/06, Neil Schemenauer wrote: > People could spell it bytes(s.encode('latin-1')) Guido wrote: > At the cost of an extra copying step. I asked: > ... why not just add some smarts to the bytes constructor? Guido wrote: > ... the VM usually keeps an extra reference > on the stack so the refcount is never 1. But > you can't rely on that I did miss this, but _PyString_Resize seems to work around it, and I'm not sure that the bytes object can't be just as intimate. Even if that is insurmountable, bytes objects could recognize two states -- one normal, and one for "I'm delegating to a string, and have to copy to my own buffer before I actually mutate anything." Then a new bytes object would still need its own header, but the data copying could often be avoided. But back to the possibility of not creating even a new object header... > the str's underlying array is allocated inline > with the str header, this require str and > bytes to have the same object layout. But > since bytes are mutable, they can't. Looking at the arraymodule, the only extra fields in an array are weakrefs, description (which will no longer be needed) and tracking for the indirection. There are even a few extra bytes leftover that could be used to indicate that ob_item was redirected later, the way tables do with small_table. -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Ron Adam <[EMAIL PROTECTED]> wrote: > Greg Ewing wrote: > > Ron Adam wrote: > >> b = bytes(0L) -> bytes([0,0,0,0]) > > > > No, bytes(0L) --> TypeError because 0L doesn't implement > > the iterator protocol or the buffer interface. > > It wouldn't need it if it was a direct C memory copy. Yes it would. Python long integers are stored as arrays of signed 16-bit short ints. See longintrepr.h from the source. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Wed, Feb 15, 2006 at 01:38:41PM -0500, Jim Jewett wrote: > On 2/14/06, Neil Schemenauer wrote: > > People could spell it bytes(s.encode('latin-1')) > > Guido wrote: > > At the cost of an extra copying step. > > I asked: > > ... why not just add some smarts to the bytes constructor? > > Guido wrote: > > > ... the VM usually keeps an extra reference > > on the stack so the refcount is never 1. But > > you can't rely on that > > I did miss this, but _PyString_Resize seems to > work around it, and I'm not sure that the bytes > object can't be just as intimate. No, _PyString_Resize doesn't work around it. _PyString_Resize only works if the refcount is exactly one: only the caller has a reference. And by 'caller', I mean 'the calling C function'. Besides that, the caller takes care to only use _PyString_Resize on strings it created itself. Theoretically it could 'steal' a reference from someplace else, but I haven't seen _PyString_Resize-using code do that, and it would be a recipe for disaster. -- Thomas Wouters <[EMAIL PROTECTED]> Hi! I'm a .signature virus! copy me into your .signature file to help me spread! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Ron Adam wrote: > I was presuming it would be done in C code and it will just need a > pointer to the first byte, memchr(), and then read n bytes directly into > a new memory range via memcpy(). If the object supports the buffer interface, it can be done that way. But if not, it would seem to make sense to fall back on the iterator protocol. > However, if it's done with a Python iterator and then each item is > translated to bytes in a sequence, (much slower), an encoding will need > to be known for it to work correctly. No, it won't. When using the bytes(x) form, encoding has nothing to do with it. It's purely a conversion from one representation of an array of 0..255 to another. When you *do* want to perform encoding, you use bytes(u, encoding) and say what encoding you want to use. > Unfortunately Unicode strings > don't set an attribute to indicate it's own encoding. I think you don't understand what an encoding is. Unicode strings don't *have* an encoding, because theyre not encoded! Encoding is what happens when you go from a unicode string to something else. > Since some longs will be of different length, yes a bytes(0L) could give > differing results on different platforms, It's not just a matter of length. I'm not sure of the details, but I believe longs are currently stored as an array of 16-bit chunks, of which only 15 bits are used. I'm having trouble imagining a use for low-level access to that format, other than just treating it as an opaque lump of data for turning back into a long later -- in which case why not just leave it as a long in the first place. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
Greg Ewing wrote: > I think you don't understand what an encoding is. Unicode > strings don't *have* an encoding, because theyre not encoded! > Encoding is what happens when you go from a unicode string > to something else. Ah.. ok, my mental picture was a bit off. I had this reversed somewhat. > It's not just a matter of length. I'm not sure of the > details, but I believe longs are currently stored as an > array of 16-bit chunks, of which only 15 bits are used. > I'm having trouble imagining a use for low-level access > to that format, other than just treating it as an opaque > lump of data for turning back into a long later -- in > which case why not just leave it as a long in the first > place. I had laps thinking Pythons longs are the same as c longs. I know Pythons longs can get much much bigger. The idea was to be able to show the byte data as is in what ever form it takes and not try to change it, weather it's longs, floats, strings, etc. Cheers, Ron ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Tue, Feb 14, 2006, Guido van Rossum wrote: > > Anyway, I'm now convinced that bytes should act as an array of ints, > where the ints are restricted to range(0, 256) but have type int. range(0, 255)? -- Aahz ([EMAIL PROTECTED]) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Feb 15, 2006, at 6:35 PM, Aahz wrote: > On Tue, Feb 14, 2006, Guido van Rossum wrote: >> >> Anyway, I'm now convinced that bytes should act as an array of ints, >> where the ints are restricted to range(0, 256) but have type int. > > range(0, 255)? No, Guido was correct. range(0, 256) is [0, 1, 2, ..., 255]. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]
On Wed, Feb 15, 2006, Bob Ippolito wrote: > On Feb 15, 2006, at 6:35 PM, Aahz wrote: >> On Tue, Feb 14, 2006, Guido van Rossum wrote: >>> >>> Anyway, I'm now convinced that bytes should act as an array of ints, >>> where the ints are restricted to range(0, 256) but have type int. >> >> range(0, 255)? > > No, Guido was correct. range(0, 256) is [0, 1, 2, ..., 255]. My mistake -- I wasn't thinking of the literal Python function. -- Aahz ([EMAIL PROTECTED]) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 332 revival in coordination with pep 349? [Was:Re: release plan for 2.5 ?]
Just van Rossum wrote: > > If bytes support the buffer interface, we get another interesting > > issue -- regular expressions over bytes. Brr. > > We already have that: > > >>> import re, array > >>> re.search('\2', array.array('B', [1, 2, 3, 4])).group() > array('B', [2]) > >>> > > Not sure whether to blame array or re, though... SRE. iirc, the design rationale was to support RE over mmap'ed regions. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com