On 01.03.2013 11:19, holger krekel wrote: > Hi Richard, all, > > somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py" > script which takes a project name as an argument and then goes to > pypi.python.org and removes all homepage/download metadata entries for > this project. This sanitizes/speeds up installation because > pip/easy_install don't need to crawl them anymore. I just did this for > three of my projects, (pytest, tox and py) and it seems to work fine.
Does it also cleanup the links that PyPI adds to the /simple/ by parsing the project description for links ? I think those are far nastier than the homepage and download links, which can be put to some good use to limit the external lookups (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal) See e.g. https://pypi.python.org/simple/zc.buildout/ for a good example of the mess this generates... even mailto links get listed and "file:///" links open up the installers for all kinds of nasty things (unless they explicitly protect against following these). > Now before i release this as a tool, i wonder: Is it a good idea to remove > download/homepage entries? Is there any current machine use (other than > the dreaded crawling) for the homepage/download_url per-release metadata > fields? > > For humans the homepage link is nicely discoverable if the long-description > doesn't mention it prominently. But i think there also is a "project url" > or "bugtrack url" for a project so maybe those could be used to reference > these important pages? (i am a bit confused on the exact meaning of those > urls, btw). > > Should we maybe stop advertising "homepage" and "download_url" > and instead see to extend project-url/bugtrackurl to be used > and shown nicely? The latter are independent of releases which i think > makes sense - what use are old probably unreachable/borked homepages > anyway. And it's also not too bad having to go once to pypi.python.org > to set it, usually it seldomly changes. I think it would be better to differentiate between showing the fields on the project pages, where they provide useful resources for people, and their use on the /simple/ index pages which are meant for programs to parse. IMO, the homepage and download links on the project pages are indeed very useful for people. On the /simple/ index a homepage link is probably not all that useful (provided a download link is set). The download links serve the purpose of directing tools to the right location, so those do belong on the /simple/ index listings. I'd completely remove the links parsed from the descriptions, since those don't really provide a good basis for crawling (the description is meant for humans to parse, not programs). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 01 2013) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig