At 12:03 PM 5/14/2009 +0200, Tarek Ziadé wrote:
Hello
for PEP 376, I have one last fuzzy point.
http://svn.python.org/view/peps/trunk/pep-0376.txt?view=markup
The "get_egg_info" api is currently based on scanning the whole
sys.path. And since sys.path can be modified by people,
so the algorithm is linear and can slow down when there are a lot of paths.
I have a proposal: let's restrict the search for this API to
site-package directories only. (directories added with
site.addsitedir)
People will be able to mark add any directory (like the per-user
site-package directory - http://www.python.org/dev/peps/pep-0370)
This requires to add in site.py a registry to keep track of all
directories added through site.addsitedir
Any thoughts ?
What tradeoffs are you optimizing for? Note that a single scan of
every directory on sys.path is exactly what happens when an import
doesn't find its target until the *last* directory on sys.path. So
this is not really a big deal if you're only doing it *once*.
If you want to optimize for repeated searches, the best way to do
this is with a structure like pkg_resources' WorkingSet object - it
simply reads the directories once and makes an object for each
installed package. These objects don't do any further I/O, so really
we're just talking about caching a list of .egg-info filenames.
Each object in the set can be queried for its metadata -- in which
case it reads it exactly once, and caches it.
With this setup, the full directory scan is only ever done once --
and it's basically equivalent to adding an extra import at the time
you first import the metadata management module.
Yes, it does mean a global, unless you want to hand off cache
management to the application. But the way pkg_resources does it,
with WorkingSet and Distribution objects, allows an app with special
needs to do its own path management and search operations.
IOW, this approach keeps simple things simple, and leaves complex
things possible. It also does less I/O than what you're proposing,
since in the normal case the directories are only ever searched once,
and the actual metadata reads are both lazy and cached.
Note, too, that site-packages dirs are likely to have more packages
on them than other directories, which means you're not necessarily
saving much I/O to start with, and even that small savings evaporates
as soon as you do more than one lookup for plugins.
_______________________________________________
Distutils-SIG maillist - Distutils-SIG@python.org
http://mail.python.org/mailman/listinfo/distutils-sig