In a message of Sat, 04 Aug 2007 09:42:45 +0200, "Martin v. Löwis" writes: >> If they do not respect them, then you can use this program: >> http://danielwebb.us/software/bot-trap/ to catch them. >> If you are doing this, Martin, use the German version instead: >> http://www.spider-trap.de/ >> because it has a few useful additions. I forget what now. >> >> Most scrapers, these days, respect robots.txt which will make this >> program useless for catching them. But some days you can get lucky. > >That would also be an idea. I'll see how the throttling works out; >if it fails (either because it still gets overloaded - which shouldn't >happen - or because legitimate users complain), I'll try that one. > >Regards, >Martin
pardon for this completely useless quoting of irrelevant text but I tried just telling catalog-sig to go read this url http://search.msn.com.my/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm&FORM=WFDD#D and check MSNbot is crawling my site too frequently. and i got suspiciopus header, which is what all the python.org groups say when they think you are sendng them spam, and not in the header. So if your text is basically a url, and you want to send it to a python.org group you are screwed. So I find an article and reply. Go read that. I think it says that we could set our crawl delay to some number -- why 120 I have no clue -- and our spider will be made behave. Or possibly we can hack the bot trap for those as not respect crawl-delay. at any rate seems relevant to our problem Laura _______________________________________________ Catalog-SIG mailing list [email protected] http://mail.python.org/mailman/listinfo/catalog-sig
