Re: [Distutils] distlib updated - comments sought

Paul Moore Fri, 05 Oct 2012 08:04:38 -0700

On 5 October 2012 15:37, Vinay Sajip <vinay_sa...@yahoo.co.uk> wrote:
>> PS If you want to start over-engineering the flexibility, users should
>> have a way of choosing whether to use the webscraper or XMLRPC
>> interfaces to PyPI. The former finds more packages (as I understand
>> it) whereas the latter is much faster. As someone who's never needed a
>> package that can't be found using both interfaces (or neither ) I
>
> Is that really the case? I'd assumed that the simple pages were generated from
> the package database created from uploads to PyPI, so I would have expected
> querying the XML-RPC interface to produce the same results as from scraping 
> the
> HTML (allowing for the possibility that, if the HTML pages are generated
> periodically as static files from the database, they might be stale at times).


Well, yes. But the static files don't make it easy to distinguish the
different categories they contain (see below)

> I thought that pip needed to scrape pages because people host distribution
> archives on servers other than PyPI (e.g. Google code, GitHub, BitBucket or
> their own servers), with the links to those archives navigable through e.g. 
> the
> "dependency_links" argument to setup(), or the URLs mentioned in the PyPI
> metadata.

I don't know how true that is these days - I don't think I've ever
personally encountered a package that wasn't either available from the
PyPI download URL (release_urls() in the XMLRPC interface) or
unavailable via pip. But my range of packages tried is fairly
limited...

The static pages merge all of the following information:

1. The download URLs you can get from the XMLRPC interface
release_urls, but with all releases covered in a single place.
2. release_data[download_url] which is available from the XMLRPC interface
3. Other URLs from release_data (home_page, project_url).

The first ones are fine, as they point to files. The second is often a
file, and seems to frequently duplicate the first. I'm not sure how
useful it is. The final one often points to a further webpage - I
presume that's what you plan to scrape. That's where the issue lies,
though, as at least some of those links time out (lxml's does, IIRC)
and as I say, I don't think I know of a case where it's actually worth
doing.

But this is based on a very superficial and limited experience. I'll
happily bow to better information.

On the other hand, is manually parsing the static page any faster in a
practical sense than using XMLRPC?

Paul.
_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
http://mail.python.org/mailman/listinfo/distutils-sig

Re: [Distutils] distlib updated - comments sought

Reply via email to