I think we all agree that scanning arbitrary HTML pages for download links is not a good idea and we need to transition away from this towards a more reliable system.
Here's an approach that would work to start the transition while not breaking old tools (sketching here to describe the basic idea): Limiting scans to download_url ------------------------------ Installers and similar tools preferably no longer scan the all links on the /simple/ index, but instead only look at the download links (which can be defined in the package meta data) for packages that don't host files on PyPI. Going only one level deep ------------------------- If the download links point to a meta-file named "<packagename>-<version>-downloads.html#<sha256-hashvalue>", the installers download that file, check whether the hash value matches and if it does, scan the file in the same way they would parse the /simple/ index page of the package - think of the downloads.html file as a symlink to extend the search to an external location, but in a predefined and safe way. Comments -------- * The creation of the downloads.html file is left to the package owner (we could have a tool to easily create it). * Since the file would use the same format as the PyPI /simple/ index directory listing, installers would be able to verify the embedded hash values (and later GPG signatures) just as they do for files hosted directly on PyPI. * The URL of the downloads.html file, together with the hash fragment, would be placed into the setup.py download_url variable. This is supported by all recent and not so recent Python versions. * No changes to older Python versions of distutils are necessary to make this work, since the download_url field is a free form field. * No changes to existing distutils meta data formats are necessary, since the download_url field has always been meant for download URLs. * Installers would not need to learn about a new meta data format, because they already know how to parse PyPI style index listings. * Installers would prefer the above approach for downloads, and warn users if they have to revert back to the old method of scanning all links. * Installers could impose extra security requirements, such as only following HTTPS links and verifying all certificates. * In a later phase of the transition we could have PyPI cache the referenced distribution files locally to improve reliability. This would turn the push strategy for uploading files to PyPI into a pull strategy for those packages and make things a lot easier to handle for package maintainers. What do you think ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 28 2013) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Catalog-SIG mailing list [email protected] http://mail.python.org/mailman/listinfo/catalog-sig
