Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
At 06:52 PM 8/2/2009 +0200, Tarek Ziadé wrote: On Wed, Jul 29, 2009 at 6:44 AM, P.J. Eby wrote: > At 10:35 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> On Tue, Jul 28, 2009 at 9:40 PM, P.J. Eby wrote: >> > At 09:22 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> >> >> I can see how this could go quite wrong, but maybe if installers touch >> >> some file in the library directory anytime a package is >> >> installed/reinstalled/removed/etc, >> > >> > You mean, like, the mtime of the directory itself? Â ;-) >> >> Do directory mtimes get recursively updated? I don't think they do. > > That's not necessary; if imports use a cached listdir, then the children > will get handled recursively. > >> So if you have a layout: >> >> site-packages/ >> zope/ >>interface/ >> __init__.py >> >> And you update the package and update __init__.py, the mtime of >> site-packages doesn't change, does it? > > Nope, but at the top level, the fact that 'zope' is present is unchanged, as > is the presence of an 'interface' subdirectory. > > >> I'm saying if there was a file in site-packages/last_updated that gets >> touched everytime an installer does anything in site-packages, then >> you could cache (between processes) the lookups. > > Since each invocation of the interpreter can have a different PYTHONPATH, > the cache has to be per-directory, not global. If it's per-directory, then > there's no real benefit over runtime caching, since you now have to open and > read a file (instead of just reading the directory). And as I said, it's > not realistic to think that opening and reading a file is going to beat > opening and reading a directory for speed. But opening and reading one file should beat opening hundreds of directories : In the PEP 376 prototype, after thinking about a per-directory cache like you are describing, I was thinking about having a global index file to replace the global dictionnary that keeps track of the distributions per directory (currently the directory path is the key in the dictionnary and the value the distribution objects). That can even be a simple shelve of the dictionary, that become a global index of directories that [are/were once] in the path. This works as long as the index file is per-user. Or even better : per-application. I don't know how this could be managed/done, but a simple cache file created alongside the script the application is launched with, could speed up the lookups at the second launch. You'd still have to stat the directories to know if they changed - in which case the logic I've already laid out still applies. I think, however, we are discussing different nominal scenarios. I'm assuming a post-PEP 376 world where the only use for .egg files or directories are for *non-default* versions of packages, that only get added to sys.path for apps or libraries that need them, rather than being in a default .pth file. However, if you're discussing speeding up an environment where we use .egg directories and they're on sys.path, then a per-user global cache might speed things up. For security reasons, however, that cache would need to be ignored by Python when running secure scripts. (e.g. -s and -E options, and definitely anything setuid.) In contrast, directory stat caching with a modest number of (non-egg) PYTHONPATH entries would speed things nicely in the hopefully-future-default case. ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On Wed, Jul 29, 2009 at 6:44 AM, P.J. Eby wrote: > At 10:35 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> On Tue, Jul 28, 2009 at 9:40 PM, P.J. Eby wrote: >> > At 09:22 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> >> >> I can see how this could go quite wrong, but maybe if installers touch >> >> some file in the library directory anytime a package is >> >> installed/reinstalled/removed/etc, >> > >> > You mean, like, the mtime of the directory itself? Â ;-) >> >> Do directory mtimes get recursively updated? I don't think they do. > > That's not necessary; if imports use a cached listdir, then the children > will get handled recursively. > >> So if you have a layout: >> >> site-packages/ >> zope/ >> interface/ >> __init__.py >> >> And you update the package and update __init__.py, the mtime of >> site-packages doesn't change, does it? > > Nope, but at the top level, the fact that 'zope' is present is unchanged, as > is the presence of an 'interface' subdirectory. > > >> I'm saying if there was a file in site-packages/last_updated that gets >> touched everytime an installer does anything in site-packages, then >> you could cache (between processes) the lookups. > > Since each invocation of the interpreter can have a different PYTHONPATH, > the cache has to be per-directory, not global. If it's per-directory, then > there's no real benefit over runtime caching, since you now have to open and > read a file (instead of just reading the directory). And as I said, it's > not realistic to think that opening and reading a file is going to beat > opening and reading a directory for speed. But opening and reading one file should beat opening hundreds of directories : For instance, a plone 3 application will have +100 sys.path entries because this zc.buildout (the Plone standard) adds one entry per egg in sys.path. So being able to cache'em should speed things up. In the PEP 376 prototype, after thinking about a per-directory cache like you are describing, I was thinking about having a global index file to replace the global dictionnary that keeps track of the distributions per directory (currently the directory path is the key in the dictionnary and the value the distribution objects). That can even be a simple shelve of the dictionary, that become a global index of directories that [are/were once] in the path. This works as long as the index file is per-user. Or even better : per-application. I don't know how this could be managed/done, but a simple cache file created alongside the script the application is launched with, could speed up the lookups at the second launch. Cheers Tarek -- Tarek Ziadé | http://ziade.org ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On Wed, 29 Jul 2009 09:37:17 +0200, Lennart Regebro wrote: 2009/7/29 Jeff Rush : Hi David. Not just your post but others here are making assumptions on your own working environment. Yes there are systems you need to save disk space on, yes there are systems where you care about I/O performance. These are embedded systems. Exactly. But the fact still is that these systems are the specialized case today, so lets stop optimizing the *default* settings for them. And the benefit of defaulting to zipped eggs is that it enforces on the developer the discipline of writing his packages to use pkg_resources instead of file I/O No, it just forces the developer to set zip_safe to False. +1. Python offers too many convenient ways to do it "wrong". Zipped eggs break deployments. They don't make developers write code that works in that environment. Such code only gets written when developers choose to care about such cases. If you want Python to excel in these areas, you need to convince developers to care. Jean-Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On 2009-07-29 16:47, David Lyon wrote: Anyway, you're kindof biting the hairs on the tail here.. because 3rd party packages don't impact the size of the whole python installation that much. My site-packages directory would like a word with you: [~]$ cd /Library/Frameworks/Python.framework/Versions/Current [Current]$ du -hsc . 1.5G. 1.5Gtotal [Current]$ du -hsc lib/python2.5/site-packages 1.4Glib/python2.5/site-packages 1.4Gtotal -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On Wed, 29 Jul 2009 01:34:11 -0500, Jeff Rush wrote: > Hi David. Not just your post but others here are making assumptions on > your own working environment. Yes there are systems you need to save > disk space on, yes there are systems where you care about I/O > performance. These are embedded systems. Maybe you too are making the assumption that I've never worked on such devices.. :-) I have.. > This attitude of allowing Python to always grow larger is prevalent on > the core developers list as well, where they are removing the ability to > compile Python selectively to drop out those portions not needed on a > platform. ok. But people want to add their own code.. rarely do they want to take away.. people resist if their code is taken away.. > Pardon the rant. I just get frustrated when people believe that the > path forward is faster and bigger systems on our desktops when actually > desktops are dying and will be rare in ten years. Only because motherboards will be embedded into the monitors more and more often.. Anyway, you're kindof biting the hairs on the tail here.. because 3rd party packages don't impact the size of the whole python installation that much. Still, it's an interesting point... David ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
2009/7/29 Jeff Rush : > Hi David. Not just your post but others here are making assumptions on > your own working environment. Yes there are systems you need to save > disk space on, yes there are systems where you care about I/O > performance. These are embedded systems. Exactly. But the fact still is that these systems are the specialized case today, so lets stop optimizing the *default* settings for them. > And the benefit of defaulting to zipped eggs is that it enforces on the > developer the discipline of writing his packages to use pkg_resources > instead of file I/O No, it just forces the developer to set zip_safe to False. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
David Lyon wrote: > > Third party libraries are rarely so big that they need to > be compressed to save disk space.. on any of the systems > that i know about anyway.. Hi David. Not just your post but others here are making assumptions on your own working environment. Yes there are systems you need to save disk space on, yes there are systems where you care about I/O performance. These are embedded systems. Python has a strong and growing following on small devices such as phones (OpenMoko), music players, settops, netbooks, OLPC/XO and such. If you haven't been following it, the Python-on-a-Chip initiative formed from several projects took place at PyCon 2009. The language is in a position to become the standard control language for devices, if we don't hobble it by assuming Python is always run on a full-blown desktop. This attitude of allowing Python to always grow larger is prevalent on the core developers list as well, where they are removing the ability to compile Python selectively to drop out those portions not needed on a platform. The attitude there was if the embedded folks want a stripped down version they can create and maintain it themselves, redoing work already done years ago. But they won't -- they'll chose the path of least resistance and choose a more lightweight language. Pardon the rant. I just get frustrated when people believe that the path forward is faster and bigger systems on our desktops when actually desktops are dying and will be rare in ten years. Let's keep Python lean and flexible so it takes up residence in the infrastructure instead. And the benefit of defaulting to zipped eggs is that it enforces on the developer the discipline of writing his packages to use pkg_resources instead of file I/O, to retain the future option of alternate packaging formats. Developers know, especially those using test-driven-development, that if you don't regularly test against an environment, your code will gradually rot and no longer run in that environment. -Jeff ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
>>P.J. Eby wrote: >>>So the optimum performance tradeoff depends on how many imports you >>>have *and* how many eggs you have on sys.path. Spoken like a true master... and it's imho a real design bludner (blunder).. sys.path is meant to contain directories for which interpretor can check for packages. Adding eggs to sys.path just prioritizes eggs (higher) and means that anytime a package is imported, virtually every egg must be opened to check if it has the appropriate package. imho it's an abuse of the sys.path to do things this way. Eggs should sit in site-packages directories like any other package and wait their turn. .zip/.egg should just be a transport format. The site-package directory should just hold packages of a like format. Third party libraries are rarely so big that they need to be compressed to save disk space.. on any of the systems that i know about anyway.. David ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
At 10:35 PM 7/28/2009 -0500, Ian Bicking wrote: On Tue, Jul 28, 2009 at 9:40 PM, P.J. Eby wrote: > At 09:22 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> I can see how this could go quite wrong, but maybe if installers touch >> some file in the library directory anytime a package is >> installed/reinstalled/removed/etc, > > You mean, like, the mtime of the directory itself? Â ;-) Do directory mtimes get recursively updated? I don't think they do. That's not necessary; if imports use a cached listdir, then the children will get handled recursively. So if you have a layout: site-packages/ zope/ interface/ __init__.py And you update the package and update __init__.py, the mtime of site-packages doesn't change, does it? Nope, but at the top level, the fact that 'zope' is present is unchanged, as is the presence of an 'interface' subdirectory. I'm saying if there was a file in site-packages/last_updated that gets touched everytime an installer does anything in site-packages, then you could cache (between processes) the lookups. Since each invocation of the interpreter can have a different PYTHONPATH, the cache has to be per-directory, not global. If it's per-directory, then there's no real benefit over runtime caching, since you now have to open and read a file (instead of just reading the directory). And as I said, it's not realistic to think that opening and reading a file is going to beat opening and reading a directory for speed. ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On Tue, Jul 28, 2009 at 9:40 PM, P.J. Eby wrote: > At 09:22 PM 7/28/2009 -0500, Ian Bicking wrote: >> >> I can see how this could go quite wrong, but maybe if installers touch >> some file in the library directory anytime a package is >> installed/reinstalled/removed/etc, > > You mean, like, the mtime of the directory itself? ;-) Do directory mtimes get recursively updated? I don't think they do. So if you have a layout: site-packages/ zope/ interface/ __init__.py And you update the package and update __init__.py, the mtime of site-packages doesn't change, does it? I'm saying if there was a file in site-packages/last_updated that gets touched everytime an installer does anything in site-packages, then you could cache (between processes) the lookups. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
At 09:22 PM 7/28/2009 -0500, Ian Bicking wrote: I can see how this could go quite wrong, but maybe if installers touch some file in the library directory anytime a package is installed/reinstalled/removed/etc, You mean, like, the mtime of the directory itself? ;-) Really, there's no need for a file. It seems really, really unlikely that there's any common filesystem where reading a file containing the (maybe out-of-date) contents of a directory is faster than just reading the directory itself. And, courtesy of the time machine, there's even a 'dircache' module already in the stdlib. i.e. if you use dircache.listdir() in place of regular listdir, you'll only have to read the directory once. (Another way to do this, of course, would be to have importlib importer objects use the same logic to keep a cache of their target directory.) ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
On Tue, Jul 28, 2009 at 8:02 PM, Greg Ewing wrote: > P.J. Eby wrote: > >> So the optimum performance tradeoff depends on how many imports you have >> *and* how many eggs you have on sys.path. If you have lots of eggs and few >> imports, unzipped ones will probably be faster. If you have lots of eggs >> and *lots* of imports, zipped ones will probably be faster. > > I'm wondering whether something could be gained by > cacheing the results of sys.path lookups somehow > between interpreter invocations. > > Most of the time the contents of the directories > on one's PYTHONPATH don't change, so doing all this > statting and directory reading every time an > interpreter starts up seems rather suboptimal. I can see how this could go quite wrong, but maybe if installers touch some file in the library directory anytime a package is installed/reinstalled/removed/etc, then it would be fast to check if the cache was correct. Though the optimization seems like its working around something that maybe shouldn't be a problem. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] Cache PYTHONPATH? (Re: make unzipped eggs be the default)
At 01:02 PM 7/29/2009 +1200, Greg Ewing wrote: P.J. Eby wrote: So the optimum performance tradeoff depends on how many imports you have *and* how many eggs you have on sys.path. If you have lots of eggs and few imports, unzipped ones will probably be faster. If you have lots of eggs and *lots* of imports, zipped ones will probably be faster. I'm wondering whether something could be gained by cacheing the results of sys.path lookups somehow between interpreter invocations. Most of the time the contents of the directories on one's PYTHONPATH don't change, so doing all this statting and directory reading every time an interpreter starts up seems rather suboptimal. The catch is that then you need some way to know whether your cache information is wrong/out-of-date. I suppose, though, that you could do something like make a file that contains stat times, such that modifying the contained directory would automatically invalidate the cache info. However, you'd probably gain more by making the core import logic simply use the dircache module (or a C equivalent thereof) in place of stat() calls. This would drop the per-import stat() count for each directory to 1 (in place of several for .py, .pyc, .pyd/.so, /__init__.py, etc.), at the cost of an initial listdir() call the first time a directory is used. This would give normal imports most of the speedup benefit that e.g. putting the stdlib in a zipfile does. ___ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig