Re: [Catalog-sig] PyPI and Wiki crawling, and a CDN

Ben Bangert Sun, 12 Aug 2007 12:41:13 -0700

On Aug 7, 2007, at 2:06 PM, Martin v. Löwis wrote:

I hope I have now solved the overload problem that massive
crawling has caused to the wiki, and, in consequence,
caused PyPI outage.


Following Laura's advice, I added Crawl-delay into robots.txt.
Several robots have picked that up, not just msnbot and slurp,
but also e.g. MJ12bot.

For the others, I had to fine-tune my throttling code, after
observing that the expensive URLs are those with a query string.
They now account for 3 regular queries (might have to bump this
to 5), so you can only do one of them every 6s.

I don't suppose there's enough resources to just have PyPI on a separate box entirely, so that whatever else is running (the wiki, etc) won't have the opportunity to drag down the package repository?

On a side-note, has anyone checked into a CDN for packages to speed up their delivery and remove more of the traffic load off the PyPi host? That would also lower the bar for other sites that wanted to mirror PyPI, since they wouldn't have to hose all the actual egg's as well.


Cheers,
Ben

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Catalog-SIG mailing list
[email protected]
http://mail.python.org/mailman/listinfo/catalog-sig

Re: [Catalog-sig] PyPI and Wiki crawling, and a CDN

Reply via email to