Re: [Python-Dev] Dropping bytes support in json
[Antoine Pitrou] Besides, Bob doesn't really seem to care about porting to py3k (he hasn't said anything about it until now, other than that he didn't feel competent to do it). His actual words were: I will need some help with 3.0 since I am not well versed in the changes to the C API or Python code for that, but merging for 2.6.1 should be no big deal. [MvL] That is quite unfortunate, and suggests that perhaps the module shouldn't have been added to Python in the first place. Bob participated actively in http://bugs.python.org/issue4136 and was responsive to detailed patch review. He gave a popular talk at PyCon less than two weeks ago. He's not derelict. I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely. I'm speechless. Raymond ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 9, 2009 at 07:15, Antoine Pitrou solip...@pitrou.net wrote: The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module: Well, your example is bad in the context of the RFC. The RFC states that JSON-text = object / array, meaning loads for 'hi' isn't strictly valid. The discrimination algorithm obviously only works in the context of that grammar, where the first character of a document must be { or [ and the next character can only be {, [, f, n, t, , -, a number, or insignificant whitespace (space, \t, \r, \n). json.loads('hi') 'hi' json.loads(u'hi'.encode('utf16')) Traceback (most recent call last): File stdin, line 1, in module File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads return _default_decoder.decode(s) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in raw_decode raise ValueError(No JSON object could be decoded) ValueError: No JSON object could be decoded Cheers, Dirkjan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecating PyOS_ascii_formatd
Eric Smith wrote: And as a reminder, the py3k-short-float-repr changes are on Rietveld at http://codereview.appspot.com/33084/show. So far, no comments. I skipped over the actual number crunching parts (the test suite will do a better job than I will of telling you whether or not you have those parts correct), but I had a look at the various other changes to make use of the new API. Looks like you were able to delete some fairly respectable chunks of redundant code! Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote: Guido van Rossum guido at python.org writes: I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard? It's not hard, it just means a lot of duplicated code if the library wants to support both str and bytes in an optimized way as Martin alluded to. This duplicated code already exists in the C parts to support the 2.x semantics of accepting unicode objects as well as str, but not in the Python parts, which explains why the bytes support is broken in py3k - in 2.x, the same Python code can be used for str and unicode. This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Barry -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Darwin) iQCVAwUBSd3Vf3EjvBPtnXfVAQKyNgQApNmI5hh9heTYynyADYaDkP8wzZFXUpgg cKYL741MbLpOFn3IFGAGaRWBQe4Dt8i4CiIEIbg3X7QZqwQJjoTtFwxsJKmXFd1M JR0oCB8Du2kE5YzD+avrEp+d8zwl2goxvzD9dJwziBav5V98w7PMiZc3sApklQFD gNYzbHEOfv4= =tjGr -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
Dirkjan Ochtman dirkjan at ochtman.nl writes: The RFC states that JSON-text = object / array, meaning loads for 'hi' isn't strictly valid. Sure, but then: json.loads('[]') [] json.loads(u'[]'.encode('utf16')) Traceback (most recent call last): File stdin, line 1, in module File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads return _default_decoder.decode(s) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in raw_decode raise ValueError(No JSON object could be decoded) ValueError: No JSON object could be decoded Cheers Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)
Martin v. Löwis wrote: Such a policy would then translate to a dead end for Python 2.x based applications. 2.x based applications *are* in a dead end, with the only exit being portage to 3.x. The actual end of the dead end just happens to be in 2013 or so :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 9, 2009 at 13:10, Antoine Pitrou solip...@pitrou.net wrote: Sure, but then: json.loads('[]') [] json.loads(u'[]'.encode('utf16')) Traceback (most recent call last): File stdin, line 1, in module File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads return _default_decoder.decode(s) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in raw_decode raise ValueError(No JSON object could be decoded) ValueError: No JSON object could be decoded Right. :) Just wanted to point your test might not be testing what you want to test. Cheers, Dirkjan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
Barry Warsaw wrote: On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote: Guido van Rossum guido at python.org writes: I'm kind of surprised that a serialization protocol like JSON wouldn't support reading/writing bytes (as the serialized format -- I don't care about having bytes as values, since JavaScript doesn't have something equivalent AFAIK, and hence JSON doesn't allow it IIRC). Marshal and Pickle, for example, *always* treat the serialized format as bytes. And since in most cases it will be sent over a socket, at some point the serialized representation *will* be bytes, I presume. What makes supporting this hard? It's not hard, it just means a lot of duplicated code if the library wants to support both str and bytes in an optimized way as Martin alluded to. This duplicated code already exists in the C parts to support the 2.x semantics of accepting unicode objects as well as str, but not in the Python parts, which explains why the bytes support is broken in py3k - in 2.x, the same Python code can be used for str and unicode. This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). All in all, as you might expect from a system that's been growing up since 1970 or so, it can be quite intractable. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] decorator module in stdlib?
Michele Simionato wrote: On Wed, Apr 8, 2009 at 7:51 PM, Guido van Rossum gu...@python.org wrote: There was a remark (though perhaps meant humorously) in Michele's page about decorators that worried me too: For instance, typical implementations of decorators involve nested functions, and we all know that flat is better than nested. I find the nested-function pattern very clear and easy to grasp, whereas I find using another decorator (a meta-decorator?) to hide this pattern unnecessarily obscuring what's going on. I understand your point and I will freely admit that I have always had mixed feelings about the advantages of a meta decorator with respect to plain simple nested functions. I see pros and contras. If functools.update_wrapper could preserve the signature I would probably use it over the decorator module. Yep, update_wrapper was a compromise along the lines of well, at least we can make sure the relevant metadata refers to the original function rather than the relatively uninteresting wrapper, even if the signature itself is lost. The idea being that you can often figure out the signature from the doc string even when introspection has been broken by an intervening wrapper. One of my hopes for PEP 362 was that I would be able to just add __signature__ to the list of copied attributes, but that PEP is currently short a champion to work through the process of resolving the open issues and creating an up to date patch (Brett ended up with too many things on his plate so he wasn't able to do it, and nobody else has offered to take it over). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Mercurial?
Martin v. Löwis wrote: Nick Coghlan wrote: Dirkjan Ochtman wrote: I have a stab at an author map at http://dirkjan.ochtman.nl/author-map. Could use some review, but it seems like a good start. Martin may be able to provide a better list of names based on the checkin name-SSH public key mapping in the SVN setup. I think the identification in the SSH keys is useless. It contains strings like loe...@mira or ncogh...@uberwald, or even multiple of them (ba...@wooz, ba...@resist, ...). Ah, I forgot our SVN accounts weren't linked up to our email addresses. I guess that means the existing list won't be as useful as I thought it might be. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] decorator module in stdlib?
On Thu, Apr 9, 2009 at 2:11 PM, Nick Coghlan ncogh...@gmail.com wrote: One of my hopes for PEP 362 was that I would be able to just add __signature__ to the list of copied attributes, but that PEP is currently short a champion to work through the process of resolving the open issues and creating an up to date patch (Brett ended up with too many things on his plate so he wasn't able to do it, and nobody else has offered to take it over). I am totally ignorant about the internals of Python and I cannot certainly take that role. But I would like to hear from Guido if he wants to support a __signature__ object or if he does not care. In the first case I think somebody will take the job, in the second case it is better to reject the PEP and be done with it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecating PyOS_ascii_formatd
Nick Coghlan wrote: Eric Smith wrote: And as a reminder, the py3k-short-float-repr changes are on Rietveld at http://codereview.appspot.com/33084/show. So far, no comments. Looks like you were able to delete some fairly respectable chunks of redundant code! Wait until you see how much nasty code gets deleted when I can actually remove PyOS_ascii_formatd! And thanks for your comments on Rietveld, especially catching the memory leak. Eric. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)
On Thu, Apr 09, 2009, Nick Coghlan wrote: Martin v. L?wis wrote: Such a policy would then translate to a dead end for Python 2.x based applications. 2.x based applications *are* in a dead end, with the only exit being portage to 3.x. The actual end of the dead end just happens to be in 2013 or so :) More like 2016 or 2020 -- as of January, my former employer was still using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in the wilds. The transition to 3.x is more extreme, and lots of people will continue making do for years after any formal support is dropped. Whether this warrants including PEP 382 in 2.x, I don't know; I still don't really understand this proposal. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] decorator module in stdlib?
Michele Simionato wrote: On Thu, Apr 9, 2009 at 2:11 PM, Nick Coghlan ncogh...@gmail.com wrote: One of my hopes for PEP 362 was that I would be able to just add __signature__ to the list of copied attributes, but that PEP is currently short a champion to work through the process of resolving the open issues and creating an up to date patch (Brett ended up with too many things on his plate so he wasn't able to do it, and nobody else has offered to take it over). I am totally ignorant about the internals of Python and I cannot certainly take that role. But I would like to hear from Guido if he wants to support a __signature__ object or if he does not care. In the first case I think somebody will take the job, in the second case it is better to reject the PEP and be done with it. I don't recall Guido being opposed when PEP 362 was first being discussed (keeping in mind that was more than 2 years ago, so he's quite entitled to have changed his mind in the meantime!). That said, it's a sensible, largely straightforward idea, and by creating the object lazily it doesn't even have to incur a runtime cost in programs that don't do much introspection. I think the main problem leading to the current lack of movement on the PEP is that the existing inspect module is good enough for most practical purposes (which are fairly rare in the first place), so this isn't perceived as a huge gain even for the folks that are interested in introspection. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)
Aahz wrote: On Thu, Apr 09, 2009, Nick Coghlan wrote: Martin v. L?wis wrote: Such a policy would then translate to a dead end for Python 2.x based applications. 2.x based applications *are* in a dead end, with the only exit being portage to 3.x. The actual end of the dead end just happens to be in 2013 or so :) More like 2016 or 2020 -- as of January, my former employer was still using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in the wilds. Indeed - I know of a system that will finally be migrating from Python 2.2 to Python *2.4* later this year :) The transition to 3.x is more extreme, and lots of people will continue making do for years after any formal support is dropped. Yeah, I was only referring to the likely minimum time frame that python-dev would continue providing security releases. As you say, the actual 2.x version of the language will live on long after the day we close all remaining 2.x only bug reports and patches as out of date. Whether this warrants including PEP 382 in 2.x, I don't know; I still don't really understand this proposal. I'd personally still prefer to keep the guideline that new features that are easy to backport *should* be backported, but that's really a decision for the authors of each new feature. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] py3k build erroring out on fileio?
Just to make sure I am not doing something silly, with a configure line as such: ./configure --prefix=/home/asmodai/local --with-wide-unicode --with-pymalloc --with-threads --with-computed-gotos, would there be any reason why I am getting the following error with both BSD make and gmake: make: don't know how to make ./Modules/_fileio.c. Stop [Will log an issue if it turns out to, indeed, be a problem with the tree and not me.] -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Forgive us our trespasses, as we forgive those that trespass against us... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] py3k build erroring out on fileio?
2009/4/9 Jeroen Ruigrok van der Werven asmo...@in-nomine.org: Just to make sure I am not doing something silly, with a configure line as such: ./configure --prefix=/home/asmodai/local --with-wide-unicode --with-pymalloc --with-threads --with-computed-gotos, would there be any reason why I am getting the following error with both BSD make and gmake: make: don't know how to make ./Modules/_fileio.c. Stop [Will log an issue if it turns out to, indeed, be a problem with the tree and not me.] It seems your Makefile is outdated. We moved the _fileio.c module around a few days, so maybe you just need a make distclean. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] py3k build erroring out on fileio?
-On [20090409 15:41], Benjamin Peterson (benja...@python.org) wrote: It seems your Makefile is outdated. We moved the _fileio.c module around a few days, so maybe you just need a make distclean. Yes, that was the cause. Thanks Benjamin. -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B You yourself, as much as anybody in the entire universe, deserve your love and affection... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
Barry Warsaw ba...@python.org wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Probably a good thing. It just promotes more confusion to do things that way, IMO. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 09, 2009, John Arbash Meinel wrote: PS I'm not yet subscribed to python-dev, so if you could make sure to CC me in replies, I would appreciate it. Please do subscribe to python-dev ASAP; I also suggest that you subscribe to python-ideas, because I suspect that this is sufficiently blue-sky to start there. As always, this is the kind of thing where code trumps gedanken, so you shouldn't expect much activity unless either you are willing to make at least initial attempts at trying out your ideas or someone else just happens to find it interesting. In general, the core Python implementation strives for simplicity, so there's already some built-in pushback. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 9, 2009 at 17:31, Aahz a...@pythoncraft.com wrote: Please do subscribe to python-dev ASAP; I also suggest that you subscribe to python-ideas, because I suspect that this is sufficiently blue-sky to start there. It might also be interesting to the unladen-swallow guys. Cheers, Dirkjan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw ba...@python.org wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Won't this work? (assuming dumps() always returns a string) def dumpb(obj, encoding='utf-8', *args, **kw): s = dumps(obj, *args, **kw) return s.encode(encoding) -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC http://stutzbachenterprises.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)
(email-sig added) At 08:07 -0400 04/09/2009, Steve Holden wrote: Barry Warsaw wrote: ... This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). ... I found it confusing myself, and did it wrong for a while. Now, I understand that essages come over the wire as bytes, either 7-bit US-ASCII or 8-bit whatever, and are parsed at the receiver. I think of the database as a wire to the future, and store the data as bytes (a BLOB), letting the future receiver parse them as it did the first time, when I cleaned the message. Data I care to query is extracted into fields (in UTF-8, what I usually use for char fields). I have no need to store messages as Unicode, and they aren't Unicode anyway. I have no need ever to flatten a message to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw 8-bit data. If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode. -- TonyN.:' mailto:tonynel...@georgeanelson.com ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)
Tony Nelson wrote: (email-sig added) At 08:07 -0400 04/09/2009, Steve Holden wrote: Barry Warsaw wrote: ... This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). ... I found it confusing myself, and did it wrong for a while. Now, I understand that essages come over the wire as bytes, either 7-bit US-ASCII or 8-bit whatever, and are parsed at the receiver. I think of the database as a wire to the future, and store the data as bytes (a BLOB), letting the future receiver parse them as it did the first time, when I cleaned the message. Data I care to query is extracted into fields (in UTF-8, what I usually use for char fields). I have no need to store messages as Unicode, and they aren't Unicode anyway. I have no need ever to flatten a message to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw 8-bit data. If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode. This is all great, and I did quite quickly realize that the best approach was to store the mails in their network byte-stream format as bytes. The approach was negated in my own case because of PostgreSQL's execrable BLOB-handling capabilities. I took a look at the escaping they required, snorted with derision and gave it up as a bad job. PostgreSQL strongly encourages you to store text as encoded columns. Because emails lack an encoding it turns out this is a most inconvenient storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's easier to store the messages in external files and just use the relational database to index those files to retrieve content, so that's what I ended up doing. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Hi John, On Thu, Apr 9, 2009 at 8:02 AM, John Arbash Meinel j...@arbash-meinel.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I've been doing some memory profiling of my application, and I've found some interesting results with how intern() works. I was pretty surprised to see that the interned dict was actually consuming a significant amount of total memory. To give the specific values, after doing: bzr branch A B of a small project, the total memory consumption is ~21MB [snip] Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: [snip] Memory usage is definitely something we're interested in improving. Since you've already looked at this in some detail, could you try implementing one or two of your ideas and see if it makes a difference in memory consumption? Changing from a dict to a set looks promising, and should be a fairly self-contained way of starting on this. If it works, please post the patch on http://bugs.python.org with your results and assign it to me for review. Thanks, Collin Winter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: [snip] Memory usage is definitely something we're interested in improving. Since you've already looked at this in some detail, could you try implementing one or two of your ideas and see if it makes a difference in memory consumption? Changing from a dict to a set looks promising, and should be a fairly self-contained way of starting on this. If it works, please post the patch on http://bugs.python.org with your results and assign it to me for review. Thanks, Collin Winter (I did end up subscribing, just with a different email address :) What is the best branch to start working from? trunk? John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 9, 2009 at 9:34 AM, John Arbash Meinel john.arbash.mei...@gmail.com wrote: ... Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: [snip] Memory usage is definitely something we're interested in improving. Since you've already looked at this in some detail, could you try implementing one or two of your ideas and see if it makes a difference in memory consumption? Changing from a dict to a set looks promising, and should be a fairly self-contained way of starting on this. If it works, please post the patch on http://bugs.python.org with your results and assign it to me for review. Thanks, Collin Winter (I did end up subscribing, just with a different email address :) What is the best branch to start working from? trunk? That's a good place to start, yes. If the idea works well, we'll want to port it to the py3k branch, too, but that can wait. Collin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
John Arbash Meinel wrote: When I looked at the actual references from interned, I saw mostly variable names. Considering that every variable goes through the python intern dict. And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. (I saw a whole lot of strings as the error codes in win32all / winerror.py, and windows error codes tend to be longer-than-average variable length.) I've read your posting twice but I'm still not sure if you are aware of the most important feature of interned strings. In the first place interning not about saving some bytes of memory but a speed optimization. Interned strings can be compared with a simple and fast pointer comparison. With interend strings you can simple write: char *a, *b; if (a == b) { ... } Instead of: char *a, *b; if (strcmp(a, b) == 0) { ... } A compiler can optimize the pointer comparison much better than a function call. Anyway, I the internals of intern() could be done a bit better. Here are some concrete things: a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict() b) Don't cache the hash key in the set, as strings already cache them. (1 long per entry). This is a big win for space, but would need to be balanced against lookup and collision resolving speed. My guess is that reducing the size of the set will actually improve speed more, because more items can fit in cache. It depends on how many times you need to resolve a collision. If the string hash is sufficiently spread out, and the load factor is reasonable, then likely when you actually find an item in the set, it will be the item you want, and you'll need to bring the string object into cache anyway, so that you can do a string comparison (rather than just a hash comparison.) c) Use the existing lookup function one time. (PySet-lookup()) Sets already have a lookup which is optimized for strings, and returns a pointer to where the object would go if it exists. Which means the intern() function can do a single lookup resolving any collisions, and return the object or insert without doing a second lookup. d) Having a special structure might also allow for separate optimizing of things like 'default size', 'grow rate', 'load factor', etc. A lot of this could be tuned specifically knowing that we really only have 1 of these objects, and it is going to be pointing at a lot of strings that are 50 bytes long. If hashes of variable name strings are well distributed, we could probably get away with a load factor of 2. If we know we are likely to have lots and lots that never go away (you rarely *unload* modules, and all variable names are in the intern dict), that would suggest having a large initial size, and probably a wide growth factor to avoid spending a lot of time resizing the set. I agree that a dict is not the most memory efficient data structure for interned strings. However dicts are extremely well tested and highly optimized. Any specialized data structure needs to be desinged and tested very carefully. If you happen to break the interning system it's going to lead to rather nasty and hard to debug problems. e) How tuned is String.hash() for the fact that most of these strings are going to be ascii text? (I know that python wants to support non-ascii variable names, but I still think there is going to be an overwhelming bias towards characters in the range 65-122 ('A'-'z'). Python 3.0 uses unicode for all names. You have to design something that can be adopted to unicode, too. By the way do you know that dicts have an optimized lookup function for strings? It's called lookdict_unicode / lookdict_string. Also note that the performance of the interned dict gets even worse on 64-bit platforms. Where the size of a 'dictentry' doubles, but the average length of a variable name wouldn't change. Anyway, I would be happy to implement something along the lines of a StringSet, or maybe the InternSet, etc. I just wanted to check if people would be interested or not. Since interning is mostly used in the core and extension modules you might want to experiment with a different growth rate. The interning data structure could start with a larger value and have a slower, non progressive data growth rate. Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)
(email-sig dropped, as I didn't see Steve Holden's message there) At 12:20 -0400 04/09/2009, Steve Holden wrote: Tony Nelson wrote: ... If you need the data from the message, by all means extract it and store it in whatever form is useful to the purpose of the database. If you need the entire message, store it intact in the database, as the bytes it is. Email isn't Unicode any more than a JPEG or other image types (often payloads in a message) are Unicode. This is all great, and I did quite quickly realize that the best approach was to store the mails in their network byte-stream format as bytes. The approach was negated in my own case because of PostgreSQL's execrable BLOB-handling capabilities. I took a look at the escaping they required, snorted with derision and gave it up as a bad job. ... I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I agree that having to import them from a file is awful. Also, there appears to be a severe limit on the size of character data fields, so storing in Base64 is out. About the only thing to do then is to use external storage for the BLOBs. Still, email seems to demand such binary storage, whether all databases provide it or not. -- TonyN.:' mailto:tonynel...@georgeanelson.com ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg (was: email package Bytes vs Unicode)
On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote: I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I think it has - BYTEA data type. Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Christian Heimes wrote: John Arbash Meinel wrote: When I looked at the actual references from interned, I saw mostly variable names. Considering that every variable goes through the python intern dict. And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. (I saw a whole lot of strings as the error codes in win32all / winerror.py, and windows error codes tend to be longer-than-average variable length.) I've read your posting twice but I'm still not sure if you are aware of the most important feature of interned strings. In the first place interning not about saving some bytes of memory but a speed optimization. Interned strings can be compared with a simple and fast pointer comparison. With interend strings you can simple write: char *a, *b; if (a == b) { ... } Instead of: char *a, *b; if (strcmp(a, b) == 0) { ... } A compiler can optimize the pointer comparison much better than a function call. Certainly. But there is a cost associated with calling intern() in the first place. You created a string, and you are now trying to de-dup it. That cost is both in the memory to track all strings interned so far, and the cost to do a dict lookup. And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. I'll also note that increasing memory does have a semi-direct effect on performance, because more memory requires more time to bring memory back and forth from main memory to CPU caches. ... I agree that a dict is not the most memory efficient data structure for interned strings. However dicts are extremely well tested and highly optimized. Any specialized data structure needs to be desinged and tested very carefully. If you happen to break the interning system it's going to lead to rather nasty and hard to debug problems. Sure. My plan was to basically take the existing Set/Dict design, and just tweak it slightly for the expected operations of interned. e) How tuned is String.hash() for the fact that most of these strings are going to be ascii text? (I know that python wants to support non-ascii variable names, but I still think there is going to be an overwhelming bias towards characters in the range 65-122 ('A'-'z'). Python 3.0 uses unicode for all names. You have to design something that can be adopted to unicode, too. By the way do you know that dicts have an optimized lookup function for strings? It's called lookdict_unicode / lookdict_string. Sure, but so does PySet. I'm not sure about lookset_unicode, but I would guess that exists or should exist for py3k. Also note that the performance of the interned dict gets even worse on 64-bit platforms. Where the size of a 'dictentry' doubles, but the average length of a variable name wouldn't change. Anyway, I would be happy to implement something along the lines of a StringSet, or maybe the InternSet, etc. I just wanted to check if people would be interested or not. Since interning is mostly used in the core and extension modules you might want to experiment with a different growth rate. The interning data structure could start with a larger value and have a slower, non progressive data growth rate. Christian I'll also mention that there are other uses for intern() where it is uniquely suitable. Namely, if you are parsing lots of text with redundant strings, it is a way to decrease total memory consumption. (And potentially speed up future comparisons, etc.) The main reason why intern() is useful for this is because it doesn't make strings immortal, as would happen if you used some other structure. Because strings know about the interned object. The options for a 3rd-party structure fall down into something like: 1) A cache that makes the strings immortal. (IIRC this is what older versions of Python did.) 2) A cache that is periodically walked to see if any of the objects are no longer externally referenced. The main problem here is that walking is O(all-objects), whereas doing the checking at refcount=0 time means you only check objects when you think the last reference has gone away. 3) Hijacking PyStringType-dealloc, so that when the refcount goes to 0 and Python want's to destroy the string, you then trigger your own cache to look and see if it should remove the object. Even further, you either have to check on every string dealloc, or re-use PyStringObject-ob_sstate to track that you have placed this string into your custom structure. Which would preclude ever calling intern() on this string, because intern() doesn't just check a couple bits, it looks at the entire ob_sstate value. I think you could make it work, such that if your custom cache had set some values, then intern() would just
Re: [Python-Dev] BLOBs in Pg
Oleg Broytmann wrote: On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote: I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I think it has - BYTEA data type. But the Python DB adapters appears to require some fairly hairy escaping of the data to make it usable with the cursor execute() method. IMHO you shouldn't have to escape data that is passed for insertion via a parameterized query. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Alexander Belopolsky wrote: On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel j...@arbash-meinel.com wrote: ... a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict() There is a rejected patch implementing just that: http://bugs.python.org/issue1507011 . Thanks for the heads up. So reading that thread, the final reason it was rejected was 2 part: Without reviewing the patch again, I also doubt it is capable of getting rid of the reference count cheating: essentially, this cheating enables the interning dictionary to have weak references to strings, this is important to allow automatic collection of certain interned strings. This feature needs to be preserved, so the cheating in the reference count must continue. That specific argument was invalid. Because the patch just changed the refcount trickery to use +- 1. And I'm pretty sure Alexander's argument was just that +- 2 was weird, not that the weakref behavior was bad. The other argument against the patch was based on the idea that: The operation give me the member equal but not identical to E is conceptually a lookup operation; the mathematical set construct has no such operation, and the Python set models it closely. IOW, set is *not* a dict with key==value. I don't know if there was any consensus reached on this, since only Martin responded this way. I can say that for my do some work with a medium size code base, the overhead of interned as a dictionary was 1.5MB out of 20MB total memory. Simply changing it to a Set would drop this to 1.0MB. I have no proof about the impact on performance, since I haven't benchmarked it yet. Changing it to a StringSet could further drop it to 0.5MB. I would guess that any performance impact would depend on whether the total size of 'interned' would fit inside L2 cache or not. There is a small bug in the original patch adding the string to the set failed. Namely it would return t == NULL which would be t != s and the intern in place would end up setting your pointer to NULL rather than doing nothing and clearing the error code. So I guess some of it comes down to whether loweis would also reject this change on the basis that mathematically a set is not a dict. Though given that his claim nobody else is speaking in favor of the patch, while at least Colin Winter has expressed some interest at this point. John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). If you allow for content-transfer-encoding: 8bit, I think there is just no way to represent email as text. You have to accept conversion to, say, base64 (or quoted-unreadable) when converting an email message to text. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg (was: email package Bytes vs Unicode)
At 21:24 +0400 04/09/2009, Oleg Broytmann wrote: On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote: I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I think it has - BYTEA data type. So it does; I see that now that I've opened up the PostgreSQL docs. I don't find escaping data to be a problem -- I do it for all untrusted data. So, after all, there isn't an example of a database that makes onerous the storing of email and other such byte-oriented data, and Python's email package has no need for workarounds in that area. -- TonyN.:' mailto:tonynel...@georgeanelson.com ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
So I guess some of it comes down to whether loweis would also reject this change on the basis that mathematically a set is not a dict. I'd like to point out that this was not the reason to reject it. Instead, this (or, the opposite of it) was given as a reason why this patch should be accepted (in msg50482). I found that a weak rationale for making that change, in particular because I think the rationale is incorrect. I like your rationale (save memory) much more, and was asking in the tracker for specific numbers, which weren't forthcoming. Though given that his claim nobody else is speaking in favor of the patch, while at least Colin Winter has expressed some interest at this point. Again, at that point in the tracker, none of the other committers had spoken in favor of the patch. Since I wasn't convinced of its correctness, and nobody else (whom I trust) had reviewed it as correct, I rejected it. Now that you brought up a specific numbers, I tried to verify them, and found them correct (although a bit unfortunate), please see my test script below. Up to 21800 interned strings, the dict takes (only) 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k interned strings is typical, I still don't know. Wrt. your proposed change, I would be worried about maintainability, in particular if it would copy parts of the set implementation. Regards, Martin import gc, sys def find_interned_dict(): cand = None for o in gc.get_objects(): if not isinstance(o, dict): continue if find_interned_dict not in o: continue for k,v in o.iteritems(): if k is not v: break else: assert not cand cand = o return cand d = find_interned_dict() print len(d), sys.getsizeof(d) l = [] for i in range(2): if i%100==0: print len(d), sys.getsizeof(d) l.append(intern(repr(i))) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] calling dictresize outside dictobject.c
Hi Dan, Thanks for your interest. 2009/4/6 Dan Schult dsch...@colgate.edu: Hi, I'm trying to write a C extension which is a subclass of dict. I want to do something like a setdefault() but with a single lookup. Looking through the dictobject code, the three workhorse routines lookdict, insertdict and dictresize are not available directly for functions outside dictobject.c, but I can get at lookdict through dict-ma_lookup(). So I use lookdict to get the PyDictEntry (call it ep) I'm looking for. The comments for lookdict say ep is ready to be set... so I do that. Then I check whether the dict needs to be resized--following the nice example of PyDict_SetItem. But I can't call dictresize to finish off the process. Should I be using PyDict_SetItem directly? No... it does its own lookup. I don't want a second lookup! I already know which entry will be filled. So then I look at the code for setdefault and it also does a double lookup for checking and setting an entry. What subtle issue am I missing? Why does setdefault do a double lookup? More globally, why isn't dictresize available through the C-API? Because it's not useful outside the intimate implementation details of dictobject.c If there isn't a reason to do a double lookup I have a patch for setdefault, but I thought I should ask here first. Raymond tells me the cost of the second lookup is negligible because of caching, but PyObject_Hash needn't be called two times. He's working on a patch later today. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou solip...@pitrou.net wrote: As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*) (*) http://www.ietf.org/rfc/rfc4627.txt That is one short and sweet RFC. :-) The RFC also specifies a discrimination algorithm for non-supersets of ASCII (“Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.”), but it is not implemented in the json module: Given the RFC specifies that the encoding used should be one of the encodings defined by Unicode, wouldn't be a better idea to remove the unicode support, instead? To me, it would make sense to use the detection algorithms for Unicode to sniff the encoding of the JSON stream and then use the detected encoding to decode the strings embed in the JSON stream. Cheers, -- Alexandre ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... I like your rationale (save memory) much more, and was asking in the tracker for specific numbers, which weren't forthcoming. ... Now that you brought up a specific numbers, I tried to verify them, and found them correct (although a bit unfortunate), please see my test script below. Up to 21800 interned strings, the dict takes (only) 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k interned strings is typical, I still don't know. Given that every variable name in any file is interned, it can grow pretty rapidly. As an extreme case, consider the file win32/lib/winerror.py which tracks all possible win32 errors. import winerror print len(winerror.__dict__) 1872 So a single error file has 1.9k strings. My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise your code looks correct. If all I do is find the interned dict, I see: print len(d) 5037 So stock python, without importing much extra (just os, sys, gc, etc.) has almost 5k strings already. I don't have a great regex yet for just extracting how many unique strings there are in a given bit of source code. However, if I do: import gc, sys def find_interned_dict(): cand = None for o in gc.get_objects(): if not isinstance(o, dict): continue if find_interned_dict not in o: continue for k,v in o.iteritems(): if k is not v: break else: assert not cand cand = o return cand d = find_interned_dict() print len(d) # Just import a few of the core structures from bzrlib import branch, repository, workingtree, builtins print len(d) I start at 5k strings, and after just importing the important bits of bzrlib, I'm at: 19,316 Now, the bzrlib source code isn't particularly huge. It is about 3.7MB / 91k lines of .py files (that is, without importing the test suite). Memory consumption with just importing bzrlib shows up at 15MB, with 300kB taken up by the intern dict. If I then import some extra bits of bzrlib, like http support, ftp support, and sftp support (which brings in python's httplib, and paramiko, and ssh/sftp implementation), I'm up to: print len(d) 25186 Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't actually done anything but import python code yet. If I sum the size of the PyString objects held in intern() it ammounts to 940KB. Though they refer to only 335KB of char data. (or an average of 13 bytes per string). Wrt. your proposed change, I would be worried about maintainability, in particular if it would copy parts of the set implementation. Right, so in the first part, I would just use Set(), as it could then save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.) I don't have numbers on how much that would improve CPU times, I would imagine improving 'intern()' would impact import times more than run times, simply because import time is interning a *lot* of strings. Though honestly, Bazaar would really like this, because startup overhead for us is almost 400ms to 'do nothing', which is a lot for a command line app. John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely. I'm speechless. It seems that my statement has surprised you, so let me explain: I think we should refrain from making design decisions (such as API decisions) without Bob's explicit consent, unless we assign a new maintainer for the simplejson module (perhaps just for the 3k branch, which perhaps would be a fork from Bob's code). Antoine suggests that Bob did not comment on the issues at hand, therefore, we should not proceed with the proposed design. Since the 3.1 release is only a few weeks ahead, we have the choice of either shipping with the broken version that is currently in the 3k branch, or drop the module from the 3k branch. I believe our users are better served by not having to waste time with a module that doesn't quite work, or may change. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
I don't have numbers on how much that would improve CPU times, I would imagine improving 'intern()' would impact import times more than run times, simply because import time is interning a *lot* of strings. Though honestly, Bazaar would really like this, because startup overhead for us is almost 400ms to 'do nothing', which is a lot for a command line app. Maybe I misunderstand your proposed change: how could the representation of the interning dict possibly change the runtime of interning? (let alone significantly) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
Alexandre Vassalotti wrote: On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou solip...@pitrou.net wrote: As for reading/writing bytes over the wire, JSON is often used in the same context as HTML: you are supposed to know the charset and decode/encode the payload using that charset. However, the RFC specifies a default encoding of utf-8. (*) (*) http://www.ietf.org/rfc/rfc4627.txt That is one short and sweet RFC. :-) It is indeed well-specified. Unfortunately, it only talks about the application/json type; the pre-existing other versions of json in MIME types vary widely, such as text/plain (possibly with a charset= parameter), text/json, or text/javascript. For these, the RFC doesn't apply. Given the RFC specifies that the encoding used should be one of the encodings defined by Unicode, wouldn't be a better idea to remove the unicode support, instead? To me, it would make sense to use the detection algorithms for Unicode to sniff the encoding of the JSON stream and then use the detected encoding to decode the strings embed in the JSON stream. That might be reasonable. (but then, I also stand by my view that we shouldn't proceed without Bob's approval). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Martin v. Löwis wrote: I don't have numbers on how much that would improve CPU times, I would imagine improving 'intern()' would impact import times more than run times, simply because import time is interning a *lot* of strings. Though honestly, Bazaar would really like this, because startup overhead for us is almost 400ms to 'do nothing', which is a lot for a command line app. Maybe I misunderstand your proposed change: how could the representation of the interning dict possibly change the runtime of interning? (let alone significantly) Regards, Martin Decreasing memory consumption lets more things fit in cache. Once the size of 'interned' is greater than fits into L2 cache, you start paying the cost of a full memory fetch, which is usually measured in 100s of cpu cycles. Avoiding double lookups in the dictionary would be less overhead, though the second lookup is probably pretty fast if there are no collisions, since everything would already be in the local CPU cache. If we were dealing in objects that were KB in size, it wouldn't matter. But as the intern dict quickly gets into MB, it starts to make a bigger difference. How big of a difference would be very CPU and dataset size specific. But certainly caches make certain things much faster, and once you overflow a cache, performance can take a surprising turn. So my primary goal is certainly a decrease of memory consumption. I think it will have a small knock-on effect of improving performance, I don't have anything to give concrete numbers. Also, consider that resizing has to evaluate every object, thus paging in all X bytes, and assigning to another 2X bytes. Cutting X by (potentially 3), would probably have a small but measurable effect. John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg
Tony Nelson wrote: At 21:24 +0400 04/09/2009, Oleg Broytmann wrote: On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote: I use MySQL, but sort of intend to learn PostgreSQL. I didn't know that PostgreSQL has no real support for BLOBs. I think it has - BYTEA data type. So it does; I see that now that I've opened up the PostgreSQL docs. I don't find escaping data to be a problem -- I do it for all untrusted data. You shouldn't have to when you are using parameterized queries. So, after all, there isn't an example of a database that makes onerous the storing of email and other such byte-oriented data, and Python's email package has no need for workarounds in that area. Create a table: CREATE TABLE tst ( id serial, byt bytea, PRIMARY KEY (id) ) WITH (OIDS=FALSE) ; ALTER TABLE tst OWNER TO steve; The following program prints 0: import psycopg2 as db conn = db.connect(database=maildb, user=@@@, password=@@@, host=localhost, port=5432) curs = conn.cursor() curs.execute(DELETE FROM tst) curs.execute(INSERT INTO tst (byt) VALUES (%s), (.join(chr(i) for i in range(256)), )) conn.commit() curs.execute(SELECT byt FROM tst) for st, in curs.fetchall(): print len(st) If I change the date to use range(1, 256) I get a ProgrammingError fron PostgreSQL invalid input syntax for type bytea. If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. My current belief is that this something is fairly deeply embedded in the PostgreSQL engine. No syntax should be necessary. I suppose if we have to go round again on this we should take it to email as we have gotten pretty far off-topic for python-dev. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg
On Thu, Apr 09, 2009, Steve Holden wrote: import psycopg2 as db conn = db.connect(database=maildb, user=@@@, password=@@@, host=localhost, port=5432) curs = conn.cursor() curs.execute(DELETE FROM tst) curs.execute(INSERT INTO tst (byt) VALUES (%s), (.join(chr(i) for i in range(256)), )) conn.commit() curs.execute(SELECT byt FROM tst) for st, in curs.fetchall(): print len(st) If I change the date to use range(1, 256) I get a ProgrammingError fron PostgreSQL invalid input syntax for type bytea. If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. My current belief is that this something is fairly deeply embedded in the PostgreSQL engine. No syntax should be necessary. You're not using a parameterized query. I suggest you post to c.l.py for more information. ;-) -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg
On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote: If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. import psycopg2 con = psycopg2.connect(database=test) cur = con.cursor() cur.execute(CREATE TABLE test (id serial, data BYTEA)) cur.execute('INSERT INTO test (data) VALUES (%s)', (psycopg2.Binary(''.join([chr(i) for i in range(256)])),)) cur.execute('SELECT * FROM test ORDER BY id') for rec in cur.fetchall(): print rec[0], type(rec[1]), repr(str(rec[1])) Result: 1 type 'buffer' '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !#$%\'()*+,-./0123456789:;=?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' What am I doing wrong? Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 9, 2009 at 1:05 PM, Martin v. Löwis mar...@v.loewis.de wrote: I can understand that you don't want to spend much time on it. How about removing it from 3.1? We could re-add it when long-term support becomes more likely. I'm speechless. It seems that my statement has surprised you, so let me explain: I think we should refrain from making design decisions (such as API decisions) without Bob's explicit consent, unless we assign a new maintainer for the simplejson module (perhaps just for the 3k branch, which perhaps would be a fork from Bob's code). Antoine suggests that Bob did not comment on the issues at hand, therefore, we should not proceed with the proposed design. Since the 3.1 release is only a few weeks ahead, we have the choice of either shipping with the broken version that is currently in the 3k branch, or drop the module from the 3k branch. I believe our users are better served by not having to waste time with a module that doesn't quite work, or may change. Most of my time to spend on json/simplejson and these mailing list discussions is on weekends, I try not to bother with it when I'm busy doing Actual Work unless there is a bug or some other issue that needs more immediate attention. I also wasn't aware that I was expected to comment on those issues. I'm CC'ed on the discussion for issue4136 but I don't see any unanswered questions directed at me. I have the issues (issue5723, issue4136) starred in my gmail and I planned to look at it more closely later, hopefully on Friday or Saturday. As far as Python 3 goes, I honestly have not yet familiarized myself with the changes to the IO infrastructure and what the new idioms are. At this time, I can't make any educated decisions with regard to how it should be done because I don't know exactly how bytes are supposed to work and what the common idioms are for other libraries in the stdlib that do similar things. Until I figure that out, someone else is better off making decisions about the Python 3 version. My guess is that it should work the same way as it does in Python 2.x: take bytes or unicode input in loads (which means encoding is still relevant). I also think the output of dumps should also be bytes, since it is a serialization, but I am not sure how other libraries do this in Python 3 because one could argue that it is also text. If other libraries that do text/text encodings (e.g. binascii, mimelib, ...) use str for input and output instead of bytes then maybe Antoine's changes are the right solution and I just don't know better because I'm not up to speed with how people write Python 3 code. I'll do my best to find some time to look into Python 3 more closely soon, but thus far I have not been very motivated to do so because Python 3 isn't useful for us at work and twiddling syntax isn't a very interesting problem for me to solve. -bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
As far as Python 3 goes, I honestly have not yet familiarized myself with the changes to the IO infrastructure and what the new idioms are. At this time, I can't make any educated decisions with regard to how it should be done because I don't know exactly how bytes are supposed to work and what the common idioms are for other libraries in the stdlib that do similar things. It's really very similar to 2.x: the bytes type is to used in all interfaces that operate on byte sequences that may or may not represent characters; in particular, for interface where the operating system deliberately uses bytes - ie. low-level file IO and socket IO; also for cases where the encoding is embedded in the stream that still needs to be processed (e.g. XML parsing). (Unicode) strings should be used where the data is truly text by nature, i.e. where no encoding information is necessary to find out what characters are intended. It's used on interfaces where the encoding is known (e.g. text IO, where the encoding is specified on opening, XML parser results, with the declared encoding, and GUI libraries, which naturally expect text). Until I figure that out, someone else is better off making decisions about the Python 3 version. Some of us can certainly explain to you how this is supposed to work. However, we need you to check any assumption against the known use cases - would the users of the module be happy if it worked one way or the other? My guess is that it should work the same way as it does in Python 2.x: take bytes or unicode input in loads (which means encoding is still relevant). I also think the output of dumps should also be bytes, since it is a serialization, but I am not sure how other libraries do this in Python 3 because one could argue that it is also text. This, indeed, had been an endless debate, and, in the end, the decision was somewhat arbitrary. Here are some examples: - base64.encodestring expects bytes (naturally, since it is supposed to encode arbitrary binary data), and produces bytes (debatably) - binascii.b2a_hex likewise (expect and produce bytes) - pickle.dumps produces bytes (uniformly, both for binary and text pickles) - marshal.dumps likewise - email.message.Message().as_string produces a (unicode) string (see Barry's recent thread on whether that's a good thing; the email package hasn't been fully ported to 3k, either) - the XML libraries (continue to) parse bytes, and produce Unicode strings - for the IO libraries, see above If other libraries that do text/text encodings (e.g. binascii, mimelib, ...) use str for input and output See above - most of them don't; mimetools is no longer (replaced by email package) instead of bytes then maybe Antoine's changes are the right solution and I just don't know better because I'm not up to speed with how people write Python 3 code. There isn't too much fresh end-user code out there, so we can't really tell, either. As for standard library users - users will do whatever the library forces them to do. This is why I'm so concerned about this issue: we should get it right, or not done at all. I still think you would be the best person to determine what is right. I'll do my best to find some time to look into Python 3 more closely soon, but thus far I have not been very motivated to do so because Python 3 isn't useful for us at work and twiddling syntax isn't a very interesting problem for me to solve. And I didn't expect you to - it seems people are quite willing to do the actual work, as long as there is some guidance. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Also, consider that resizing has to evaluate every object, thus paging in all X bytes, and assigning to another 2X bytes. Cutting X by (potentially 3), would probably have a small but measurable effect. I'm *very* skeptical about claims on performance in the absence of actual measurements. Too many effects come together, so the actual performance is difficult to predict (and, for that prediction, you would need *at least* a work load that you want to measure - starting bzr would be such a workload, of course). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BLOBs in Pg
Oleg Broytmann wrote: On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote: If I can't pass a 256-byte string into a BLOB and get it back without anything like this happening then there's *something* in the chain that makes the database useless. import psycopg2 con = psycopg2.connect(database=test) cur = con.cursor() cur.execute(CREATE TABLE test (id serial, data BYTEA)) cur.execute('INSERT INTO test (data) VALUES (%s)', (psycopg2.Binary(''.join([chr(i) for i in range(256)])),)) cur.execute('SELECT * FROM test ORDER BY id') for rec in cur.fetchall(): print rec[0], type(rec[1]), repr(str(rec[1])) Result: 1 type 'buffer' '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !#$%\'()*+,-./0123456789:;=?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' What am I doing wrong? Oleg. Corresponding with me, probably. Thank you Oleg. I feel suddenly saner again. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ Watch PyCon on video now! http://pycon.blip.tv/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Apr 9, 2009, at 12:06 PM, Martin v. Löwis wrote: Now that you brought up a specific numbers, I tried to verify them, and found them correct (although a bit unfortunate), please see my test script below. Up to 21800 interned strings, the dict takes (only) 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k interned strings is typical, I still don't know. Wrt. your proposed change, I would be worried about maintainability, in particular if it would copy parts of the set implementation. I connected to a random one of our processes, which has been running for a typical amount of time and is currently at ~300MB RSS. (gdb) p *(PyDictObject*)interned $2 = {ob_refcnt = 1, ob_type = 0x8121240, ma_fill = 97239, ma_used = 95959, ma_mask = 262143, ma_table = 0xa493c008, } Going from 3MB to 2.25MB isn't much, but it's not nothing, either. I'd be skeptical of cache performance arguments given that the strings used in any particular bit of code should be spread pretty much evenly throughout the hash table, and 3MB seems solidly bigger than any L2 cache I know of. You should be able to get meaningful numbers out of a C profiler, but I'd be surprised to see the act of interning taking a noticeable amount of time. -jake ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
John Arbash Meinel wrote: And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. Keep in mind that intern() is called fairly rarely, mostly only at module load time. It may not be worth attempting to speed it up. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
2009/4/9 Greg Ewing greg.ew...@canterbury.ac.nz: John Arbash Meinel wrote: And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. Keep in mind that intern() is called fairly rarely, mostly only at module load time. It may not be worth attempting to speed it up. That's very important, though, for a command line tool for bazaar. Even a few fractions of a second can make a difference in user perception of speed. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
Greg Ewing wrote: John Arbash Meinel wrote: And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. That's even rarer still, since it only happens the first time you load a piece of code that uses a given variable name anywhere in any module. Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Evaluated cmake as an autoconf replacement
cmake does not produce relative paths in its generated make and project files. There is an option CMAKE_USE_RELATIVE_PATHS which appears to do this but the documentation says: This option does not work for more complicated projects, and relative paths are used when possible. In general, it is not possible to move CMake generated makefiles to a different location regardless of the value of this variable. This means that generated Visual Studio project files will not work for other people unless a particular absolute build location is specified for everyone which will not suit most. Each person that wants to build Python will have to run cmake before starting Visual Studio thus increasing the prerequisites. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 8:07 AM, Steve Holden wrote: The real problem I came across in storing email in a relational database was the inability to store messages as Unicode. Some messages have a body in one encoding and an attachment in another, so the only ways to store the messages are either as a monolithic bytes string that gets parsed when the individual components are required or as a sequence of components in the database's preferred encoding (if you want to keep the original encoding most relational databases won't be able to help unless you store the components as bytes). All in all, as you might expect from a system that's been growing up since 1970 or so, it can be quite intractable. There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into strings for text/ * types and bytes for anything else (not counting multiparts). The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course. It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application. Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 11:08 AM, Bill Janssen wrote: Barry Warsaw ba...@python.org wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Probably a good thing. It just promotes more confusion to do things that way, IMO. Very possibly so. But applications will definitely want stuff like the text/plain payload as a unicode, or the image/gif payload as a bytes (or even as a PIL image or whatever). Not that I think the email package needs to know about every content type under the sun, but I do think that it should be pluggable so as to allow applications to more conveniently access the data that way. Possibly the defaults should be unicodes for any text/* type and bytes for everything else. -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)
On Apr 9, 2009, at 11:11 PM, gl...@divmod.com wrote: I think this is a problematic way to model bytes vs. text; it gives text a special relationship to bytes which should be avoided. IMHO the right way to think about domains like this is a multi-level representation. The low level representation is always bytes, whether your MIME type is text/whatever or application/x-i-dont-know. This is a really good point, and I really should be clearer when describing my current thinking (sleep would help :). The thing that's special about text is that it's a high level representation that the standard library can know about. But the 'email' package ought to support being extended to support other types just as well. For example, I want to ask for image/png content as PIL.Image objects, not bags of bytes. Of course this presupposes some way for PIL itself to get at some bytes, but then you need the email module itself to get at the bytes to convert to text in much the same way. There also needs to be layering at the level of bytes-base64-some different bytes-PIL-Image. There are mail clients that will base64-encode unusual encodings so you have to do that same layering for text sometimes. I'm also being somewhat handwavy with talk of low and high level representations; of course there are actually multiple levels beyond that. I might want text/x-python content to show up as an AST, but the intermediate DOM-parsing representation really wants to operate on characters. Similarly for a DOM and text/html content. (Modulo the usual encoding-detection weirdness present in parsers.) When I was talking about supporting text/* content types as strings, I was definitely thinking about using basically the same plug-in or higher level or whatever API to do that as you might use to get PIL images from an image/gif. So, as long as there's a crisp definition of what layer of the MIME stack one is operating on, I don't think that there's really any ambiguity at all about what type you should be getting. In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- center API first, and build things on top of that. -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 10:52 PM, Aahz wrote: On Thu, Apr 09, 2009, Barry Warsaw wrote: So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: message['Subject'] The raw bytes or the decoded unicode? Let's make that the raw bytes by default -- we can add a parameter to Message() to specify that the default where possible is unicode for returned values, if that isn't too painful. I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
Barry Warsaw wrote: I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one. So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 2:25 PM, Martin v. Löwis wrote: This is an interesting question, and something I'm struggling with for the email package for 3.x. It turns out to be pretty convenient to have both a bytes and a string API, both for input and output, but I think email really wants to be represented internally as bytes. Maybe. Or maybe just for content bodies and not headers, or maybe both. Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). If you allow for content-transfer-encoding: 8bit, I think there is just no way to represent email as text. You have to accept conversion to, say, base64 (or quoted-unreadable) when converting an email message to text. Agreed. But applications will want to deal with some parts of the message as text on the boundaries. Internally, it should be all bytes (although even that is a pain to write ;). -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)
On 02:26 am, ba...@python.org wrote: There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into strings for text/ * types and bytes for anything else (not counting multiparts). I think this is a problematic way to model bytes vs. text; it gives text a special relationship to bytes which should be avoided. IMHO the right way to think about domains like this is a multi-level representation. The low level representation is always bytes, whether your MIME type is text/whatever or application/x-i-dont-know. The thing that's special about text is that it's a high level representation that the standard library can know about. But the 'email' package ought to support being extended to support other types just as well. For example, I want to ask for image/png content as PIL.Image objects, not bags of bytes. Of course this presupposes some way for PIL itself to get at some bytes, but then you need the email module itself to get at the bytes to convert to text in much the same way. There also needs to be layering at the level of bytes-base64-some different bytes-PIL-Image. There are mail clients that will base64-encode unusual encodings so you have to do that same layering for text sometimes. I'm also being somewhat handwavy with talk of low and high level representations; of course there are actually multiple levels beyond that. I might want text/x-python content to show up as an AST, but the intermediate DOM-parsing representation really wants to operate on characters. Similarly for a DOM and text/html content. (Modulo the usual encoding-detection weirdness present in parsers.) So, as long as there's a crisp definition of what layer of the MIME stack one is operating on, I don't think that there's really any ambiguity at all about what type you should be getting. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Thu, Apr 09, 2009, Barry Warsaw wrote: So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: message['Subject'] The raw bytes or the decoded unicode? Let's make that the raw bytes by default -- we can add a parameter to Message() to specify that the default where possible is unicode for returned values, if that isn't too painful. Here's my reasoning: ultimately, everyone NEEDS to understand that the underlying transport for e-mail is bytes (similar to sockets). We do people no favors by pasting over this too much. We can overlay convenience at various points, but except for text payloads, everything should be bytes by default. Even for text payloads, I'm not entirely certain the default shouldn't be bytes: consider an HTML attachment that you want to compare against the output from a webserver. Still, as long as it's easy to get bytes for text payloads, I think overall I'm still leaning toward unicode for them. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Why is this newsgroup different from all other newsgroups? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote: On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw ba...@python.org wrote: Anyway, aside from that decision, I haven't come up with an elegant way to allow /output/ in both bytes and strings (input is I think theoretically easier by sniffing the arguments). Won't this work? (assuming dumps() always returns a string) def dumpb(obj, encoding='utf-8', *args, **kw): s = dumps(obj, *args, **kw) return s.encode(encoding) So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: message['Subject'] The raw bytes or the decoded unicode? Okay, so you've picked one. Now how do you spell the other way? The Message class probably has these explicit methods: Message.get_header_bytes('Subject') Message.get_header_string('Subject') (or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice? Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably. Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes') One of those maps to message['Subject'] = ??? I'm open to any suggestions here! -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)
On Apr 9, 2009, at 12:20 PM, Steve Holden wrote: PostgreSQL strongly encourages you to store text as encoded columns. Because emails lack an encoding it turns out this is a most inconvenient storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's easier to store the messages in external files and just use the relational database to index those files to retrieve content, so that's what I ended up doing. That's not insane for other reasons. Do you really want to store 10MB of mp3 data in your database? Which of course reminds me that I want to add an interface, probably to the parser and message class, to allow an application to store message payloads in other than memory. Parsing and holding onto messages with huge payloads can kill some applications, when you might not care too much about the actual payload content. Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote: Barry Warsaw wrote: I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one. Agreed! So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs). Yes, that's a very interesting (and proven?) model. I don't quite see how we could apply that email and json, but it seems like there's a good idea there. ;) -Barry PGP.sig Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Email-SIG] Dropping bytes support in json
At 22:38 -0400 04/09/2009, Barry Warsaw wrote: ... So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: message['Subject'] The raw bytes or the decoded unicode? That's an easy one: Subject: is an unstructured header, so it must be text, thus Unicode. We're looking at a high-level representation of an email message, with parsed header fields and a MIME message tree. Okay, so you've picked one. Now how do you spell the other way? message.get_header_bytes('Subject') Oh, I see that's what you picked. The Message class probably has these explicit methods: Message.get_header_bytes('Subject') Message.get_header_string('Subject') (or better names... it's late and I'm tired ;). One of those maps to message['Subject'] but which is the more obvious choice? Structured header fields are more of a problem. Any header with addresses should return a list of addresses. I think the default return type should depend on the data type. To get an explicit bytes or string or list of addresses, be explicit; otherwise, for convenience, return the appropriate type for the particular header field name. Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably. Never for header fields. The default is always RFC 2047, unless it isn't, say for params. The Message class should create an object of the appropriate subclass of Header based on the name (or use the existing object, see other discussion), and that should inspect its argument and DTRT or complain. Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes') One of those maps to message['Subject'] = ??? The expected data type should depend on the header field. For Subject:, it should be bytes to be parsed or verbatim text. For To:, it should be a list of addresses or bytes or text to be parsed. The email package should be pythonic, and not require deep understanding of dozens of RFCs to use properly. Users don't need to know about the raw bytes; that's the whole point of MIME and any email package. It should be easy to set header fields with their natural data types, and doing it with bad data should produce an error. This may require a bit more care in the message parser, to always produce a parsed message with defects. -- TonyN.:' mailto:tonynel...@georgeanelson.com ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On 9-Apr-09, at 6:24 PM, John Arbash Meinel wrote: Greg Ewing wrote: John Arbash Meinel wrote: And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. That's even rarer still, since it only happens the first time you load a piece of code that uses a given variable name anywhere in any module. Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. I don't want to quash your idealism too severely, but it is extremely unlikely that you are going to get anywhere near that kind of speed up by tweaking string interning. 25k times doing anything (computation) just isn't all that much. $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 8.28 msec per loop Perhaps this isn't representative (int hashing is ridiculously cheap, for instance), but the dict itself is far bigger than the dict you are dealing with and such would have similar cache-busting properties. And yet, 25k accesses (plus python-c dispatching costs which you are paying with interning) consume only ~10ms. You could do more good by eliminating a handful of disk seeks by reducing the number of imported modules... -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] decorator module in stdlib?
On Wed, Apr 8, 2009 at 9:31 PM, Michele Simionato michele.simion...@gmail.com wrote: Then perhaps you misunderstand the goal of the decorator module. The raison d'etre of the module is to PRESERVE the signature: update_wrapper unfortunately *changes* it. When confronted with a library which I do not not know, I often run over it pydoc, or sphinx, or a custom made documentation tool, to extract the signature of functions. Ah, I see. Personally I rarely trust automatically extracted documentation -- too often in my experience it is out of date or simply absent. Extracting the signatures in theory wouldn't lie, but in practice I still wouldn't trust it -- not only because of what decorators might or might not do, but because it might still be misleading. Call me old-fashioned, but I prefer to read the source code. For instance, if I see a method get_user(self, username) I have a good hint about what it is supposed to do. But if the library (say a web framework) uses non signature-preserving decorators, my documentation tool says to me that there is function get_user(*args, **kwargs) which frankly is not enough [this is the optimistic case, when the author of the decorator has taken care to preserve the name of the original function]. But seeing the decorator is often essential for understanding what goes on! Even if the decorator preserves the signature (in truth or according inspect), many decorators *do* something, and it's important to know how a function is decorated. For example, I work a lot with a small internal framework at Google whose decorators can raise exceptions and set instance variables; they also help me understand under which conditions a method can be called. I *hate* losing information about the true signature of functions, since I also use a lot IPython, Python help, etc. I guess we just have different styles. That's fine. I must admit that while I still like decorators, I do like them as much as in the past. Of course there was a missing NOT in this sentence, but you all understood the intended meaning. (All this BTW is not to say that I don't trust you with commit privileges if you were to be interested in contributing. I just don't think that adding that particular decorator module to the stdlib would be wise. It can be debated though.) Fine. As I have repeated many time that particular module was never meant for inclusion in the standard library. Then perhaps it shouldn't -- I haven't looked but if you don't plan stdlib inclusion it is often the case that the API style and/or implementation details make stdlib inclusion unrealistic. (Though admittedly some older modules wouldn't be accepted by today's standards either -- and I'm not just talking PEP-8 compliance! :-) But I feel strongly about the possibility of being able to preserve (not change!) the function signature. That could be added to functools if enough people want it. I do not think everybody disagree with your point here. My point still stands, though: objects should not lie about their signature, especially during debugging and when generating documentation from code. Source code never lies. Debuggers should make access to the source code a key point. And good documentation should be written by a human, not automatically cobbled together from source code and a few doc strings. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Email-SIG] Dropping bytes support in json
At 22:26 -0400 04/09/2009, Barry Warsaw wrote: There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into strings for text/ * types and bytes for anything else (not counting multiparts). The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course. Storing an email message in a database does mean storing some of the header fields as database fields, but the set of email header fields is open, so any unused fields in a message must be stored elsewhere. It isn't useful to just have a bag of name/value pairs in a table. General message MIME payload trees don't map well to a database either, unless one wants to get very relational. Sometimes the database needs to represent the entire email message, header fields and MIME tree, but only if it is an email program and usually not even then. Usually, the database has a specific purpose, and can be designed for the data it cares about; it may choose to keep the original message as bytes. It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application. I think of header fields as having text-like names (the set of allowed characters is more than just text, though defined headers don't make use of that), but the data is either bytes or it should be parsed into something appropriate: text for unstructured fields like Subject:, a list of addresses for address fields like To:. Many of the structured header fields have a reasonable mapping to text; certainly this is true for adress header fields. Content-Type header fields are barely text, they can be so convolutedly structured, but I suppose one could flatten one of them to text instead of bytes if the user wanted. It's not very useful, though, except for debugging (either by the programmer or the recipient who wants to know what was cleaned from the message). Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. If a database stores the Subject: header field, it would be as text. The various recipient address fields are a one message to many names and addresses mapping, and need a related table of name/address fields, with each field being text. The original message (or whatever part of it one preserves) should be bytes. I don't think this complicates the email package API; rather, it just shows where generality is needed. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x You now have the opportunity to finally unsnarl that mess. It is not an insurmountable opportunity. -- TonyN.:' mailto:tonynel...@georgeanelson.com ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel john.arbash.mei...@gmail.com wrote: Greg Ewing wrote: John Arbash Meinel wrote: And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. That's even rarer still, since it only happens the first time you load a piece of code that uses a given variable name anywhere in any module. Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. I think you have plenty of a case to try it out. If you code it up and it doesn't speed anything up, well then we've learned something, and maybe it'll be useful anyway for the memory savings. If it does speed things up, well then Python's faster. I wouldn't waste time arguing about it before you have the change written. Good luck! Jeffrey ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel john.arbash.mei...@gmail.com wrote: Greg Ewing wrote: John Arbash Meinel wrote: And the way intern is currently written, there is a third cost when the item doesn't exist yet, which is another lookup to insert the object. That's even rarer still, since it only happens the first time you load a piece of code that uses a given variable name anywhere in any module. Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. Quite so. We have a number of internal tools, and they find that frequently just starting up Python takes several times the duration of the actual work unit itself. I'd be very interested to review any patches you come up with to improve start-up time; so far on this thread, there's been a lot of theory and not much practice. I'd approach this iteratively: first replace the dict with a set, then if that bears fruit, consider a customized data structure; if that bears fruit, etc. Good luck, and be sure to let us know what you find, Collin Winter ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
On Thu, Apr 9, 2009 at 9:07 PM, Collin Winter coll...@gmail.com wrote: On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel john.arbash.mei...@gmail.com wrote: And I would be a *lot* happier if startup time was 100ms instead of 400ms. Quite so. We have a number of internal tools, and they find that frequently just starting up Python takes several times the duration of the actual work unit itself. I'd be very interested to review any patches you come up with to improve start-up time; so far on this thread, there's been a lot of theory and not much practice. I'd approach this iteratively: first replace the dict with a set, then if that bears fruit, consider a customized data structure; if that bears fruit, etc. Good luck, and be sure to let us know what you find, Just to add some skepticism, has anyone done any kind of instrumentation of bzr start-up behavior? IIRC every time I was asked to reduce the start-up cost of some Python app, the cause was too many imports, and the solution was either to speed up import itself (.pyc files were the first thing ever that came out of that -- importing from a single .zip file is one of the more recent tricks) or to reduce the number of modules imported at start-up (or both :-). Heavy-weight frameworks are usually the root cause, but usually there's nothing that can be done about that by the time you've reached this point. So, amen on the good luck, but please start with a bit of analysis. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)
On Thu, Apr 9, 2009 at 5:53 AM, Aahz a...@pythoncraft.com wrote: On Thu, Apr 09, 2009, Nick Coghlan wrote: Martin v. L?wis wrote: Such a policy would then translate to a dead end for Python 2.x based applications. 2.x based applications *are* in a dead end, with the only exit being portage to 3.x. The actual end of the dead end just happens to be in 2013 or so :) More like 2016 or 2020 -- as of January, my former employer was still using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in the wilds. The transition to 3.x is more extreme, and lots of people will continue making do for years after any formal support is dropped. There's nothing wrong with that. People using 1.5.2 today certainly aren't asking for support, and people using 2.3 probably aren't expecting much either. That's fine, those Python versions are as stable as the rest of their environment. (I betcha they're still using GCC 2.96 too, though they probably don't have any reason to build a new Python binary from source. :-) People *will* be using 2.6 well past 2013. But will they care about the Python community actively supporting it? Of course not! Anything we did would probably break something for them. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Rethinking intern() and its data structure
... Somewhat true, though I know it happens 25k times during startup of bzr... And I would be a *lot* happier if startup time was 100ms instead of 400ms. I don't want to quash your idealism too severely, but it is extremely unlikely that you are going to get anywhere near that kind of speed up by tweaking string interning. 25k times doing anything (computation) just isn't all that much. $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 8.28 msec per loop Perhaps this isn't representative (int hashing is ridiculously cheap, for instance), but the dict itself is far bigger than the dict you are dealing with and such would have similar cache-busting properties. And yet, 25k accesses (plus python-c dispatching costs which you are paying with interning) consume only ~10ms. You could do more good by eliminating a handful of disk seeks by reducing the number of imported modules... -Mike You're also using timeit over the same set of 25k keys, which means it only has to load that subset. And as you are using identical runs each time, those keys are already loaded into your cache lines... And given how hash(int) works, they are all sequential in memory, and all 10M in your original set have 0 collisions. Actually, at 10M, you'll have a dict of size 20M entries, and the first 10M entries will be full, and the trailing 10M entries will all be empty. That said, you're right, the benefits of a smaller structure are going to be small. I'll just point that if I just do a small tweak to your timing and do: $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 6.27 msec per loop So slightly faster than yours, *but*, lets try a much smaller dict: $ python -mtimeit -s 'd=dict.fromkeys(xrange(25000))' 'for x in xrange(25000): d.get(x)' 100 loops, best of 3: 6.35 msec per loop Pretty much the same time. Well within the noise margin. But if I go back to the big dict and actually select 25k keys across the whole set: $ TIMEIT -s 'd=dict.fromkeys(xrange(1000));' \ -s keys=range(0,1000,1000/25000)' \ 'for x in keys: d.get(x)' 100 loops, best of 3: 13.1 msec per loop Now I'm still accessing 25k keys, but I'm doing it across the whole range, and suddenly the time *doubled*. What about slightly more random access: $ TIMEIT -s 'import random; d=dict.fromkeys(xrange(1000));' -s 'bits = range(0, 1000, 400); random.shuffle(bits)'\ 'for x in bits: d.get(x)' 100 loops, best of 3: 15.5 msec per loop Not as big of a difference as I thought it would be... But I bet if there was a way to put the random shuffle in the inner loop, so you weren't accessing the same identical 25k keys internally, you might get more interesting results. As for other bits about exercising caches: $ shuffle(range(0, 1000, 400)) 100 loops, best of 3: 15.5 msec per loop $ shuffle(range(0, 1000, 40)) 10 loops, best of 3: 175 msec per loop 10x more keys, costs 11.3x, pretty close to linear. $ shuffle(range(0, 1000, 10)) 10 loops, best of 3: 739 msec per loop 4x the keys, 4.5x the time, starting to get more into nonlinear effects. Anyway, you're absolutely right. intern() overhead is a tiny fraction of 'import bzrlib.*' time, so I don't expect to see amazing results. That said, accessing 25k keys in a smaller structure is 2x faster than accessing 25k keys spread across a larger structure. John =:- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On 02:38 am, ba...@python.org wrote: So, what I'm really asking is this. Let's say you agree that there are use cases for accessing a header value as either the raw encoded bytes or the decoded unicode. What should this return: message['Subject'] The raw bytes or the decoded unicode? My personal preference would be to just get deprecate this API, and get rid of it, replacing it with a slightly more explicit one. message.headers['Subject'] message.bytes_headers['Subject'] Now, setting headers. Sometimes you have some unicode thing and sometimes you have some bytes. You need to end up with bytes in the ASCII range and you'd like to leave the header value unencoded if so. But in both cases, you might have bytes or characters outside that range, so you need an explicit encoding, defaulting to utf-8 probably. message.headers['Subject'] = 'Some text' should be equivalent to message.headers['Subject'] = Header('Some text') My preference would be that message.headers['Subject'] = b'Some Bytes' would simply raise an exception. If you've got some bytes, you should instead do message.bytes_headers['Subject'] = b'Some Bytes' or message.headers['Subject'] = Header(bytes=b'Some Bytes', encoding='utf-8') Explicit is better than implicit, right? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Dropping bytes support in json
On 03:21 am, ncogh...@gmail.com wrote: Barry Warsaw wrote: I don't know whether the parameter thing will work or not, but you're probably right that we need to get the bytes-everywhere API first. Given that json is a wire protocol, that sounds like the right approach for json as well. Once bytes-everywhere works, then a text API can be built on top of it, but it is difficult to build a bytes API on top of a text one. I wish I could agree, but JSON isn't really a wire protocol. According to http://www.ietf.org/rfc/rfc4627.txt JSON is a text format for the serialization of structured data. There are some notes about encoding, but it is very clearly described in terms of unicode code points. So I guess the IO library *is* the right model: bytes at the bottom of the stack, with text as a wrapper around it (mediated by codecs). In email's case this is true, but in JSON's case it's not. JSON is a format defined as a sequence of code points; MIME is defined as a sequence of octets. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Email-SIG] Dropping bytes support in json
Barry Warsaw writes: There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Indeed! Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into strings for text/ * types and bytes for anything else (not counting multiparts). *sigh* Why are you back-tracking? The payload should be of an appropriate *object* type. Atomic object types will have their content stored as string or bytes [nb I use Python 3 terminology throughout]. Composite types (multipart/*) won't need string or bytes attributes AFAICS. Start by implementing the application/octet-stream and text/plain;charset=utf-8 object types, of course. It does seem to make sense to think about headers as text header names and text header values. I disagree. IMHO, structured header types should have object values, and something like message['to'] = Barry 'da FLUFL' Warsaw ba...@python.org should be smart enough to detect that it's a string and attempt to (flexibly) parse it into a fullname and a mailbox adding escapes, etc. Whether these should be structured objects or they can be strings or bytes, I'm not sure (probably bytes, not strings, though -- see next exampl). OTOH message['to'] = b'''Barry 'da.FLUFL' Warsaw ba...@python.org''' should assume that the client knows what they are doing, and should parse it strictly (and I mean be a real bastard, eg, raise an exception on any non-ASCII octet), merely dividing it into fullname and mailbox, and caching the bytes for later insertion in a wire-format message. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated. I don't see why you can't have the email API be specific, with message['to'] always returning a structured_header object (or maybe even more specifically an address_header object), and methods like message['to'].build_header_as_text() which returns To: Barry 'da.FLUFL' Warsaw ba...@python.org and message['to'].build_header_in_wire_format() which returns bTo: Barry 'da.FLUFL' Warsaw ba...@python.org Then have email.textview.Message and email.wireview.Message which provide a simple interface where message['to'] would invoke .build_header_as_text() and .build_header_in_wire_format() respectively. Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x Er, yeah. Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs, ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] decorator module in stdlib?
Then perhaps you misunderstand the goal of the decorator module. The raison d'etre of the module is to PRESERVE the signature: update_wrapper unfortunately *changes* it. When confronted with a library which I do not not know, I often run over it pydoc, or sphinx, or a custom made documentation tool, to extract the signature of functions. Ah, I see. Personally I rarely trust automatically extracted documentation -- too often in my experience it is out of date or simply absent. Extracting the signatures in theory wouldn't lie, but in practice I still wouldn't trust it -- not only because of what decorators might or might not do, but because it might still be misleading. Call me old-fashioned, but I prefer to read the source code. For instance, if I see a method get_user(self, username) I have a good hint about what it is supposed to do. But if the library (say a web framework) uses non signature-preserving decorators, my documentation tool says to me that there is function get_user(*args, **kwargs) which frankly is not enough [this is the optimistic case, when the author of the decorator has taken care to preserve the name of the original function]. But seeing the decorator is often essential for understanding what goes on! Even if the decorator preserves the signature (in truth or according inspect), many decorators *do* something, and it's important to know how a function is decorated. For example, I work a lot with a small internal framework at Google whose decorators can raise exceptions and set instance variables; they also help me understand under which conditions a method can be called. I *hate* losing information about the true signature of functions, since I also use a lot IPython, Python help, etc. I guess we just have different styles. That's fine. I must admit that while I still like decorators, I do like them as much as in the past. Of course there was a missing NOT in this sentence, but you all understood the intended meaning. (All this BTW is not to say that I don't trust you with commit privileges if you were to be interested in contributing. I just don't think that adding that particular decorator module to the stdlib would be wise. It can be debated though.) Fine. As I have repeated many time that particular module was never meant for inclusion in the standard library. Then perhaps it shouldn't -- I haven't looked but if you don't plan stdlib inclusion it is often the case that the API style and/or implementation details make stdlib inclusion unrealistic. (Though admittedly some older modules wouldn't be accepted by today's standards either -- and I'm not just talking PEP-8 compliance! :-) But I feel strongly about the possibility of being able to preserve (not change!) the function signature. That could be added to functools if enough people want it. My original suggestion for inclusion in stdlib was motivated by this reason alone: I'd like to see an official one way of preserving function signatures by decorators. If there are better ways of doing it than the decorator module, that's totally fine, but there should be one. Cheers, Daniel I do not think everybody disagree with your point here. My point still stands, though: objects should not lie about their signature, especially during debugging and when generating documentation from code. Source code never lies. Debuggers should make access to the source code a key point. And good documentation should be written by a human, not automatically cobbled together from source code and a few doc strings. -- Psss, psss, put it down! - http://www.cafepress.com/putitdown ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com