Re: [Tutor] What can I do if I'm banned from a website??
On Wed, Oct 10, 2012 at 4:35 PM, Benjamin Fishbein bfishbei...@gmail.comwrote: I've been scraping info from a website with a url program I wrote. But now I can't open their webpage, no matter which web browser I use. I think they've somehow blocked me. How can I get back in? Is it a temporary block? And can I get in with the same computer from a different wifi? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor A few thoughts: 1. Try using a proxy. 2. Ask the webadmin (nicely) to unban you 3. Use requests / urllib3 to se connection pooling 4. See if the site has an API designed for data extraction. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] What can I do if I'm banned from a website??
how could someone know enough to write their own web-scraping program and NOT know that this is not about python or how to get around this problem? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] What can I do if I'm banned from a website??
On 10/10/2012 21:35, Benjamin Fishbein wrote: I've been scraping info from a website with a url program I wrote. But now I can't open their webpage, no matter which web browser I use. I think they've somehow blocked me. How can I get back in? Is it a temporary block? And can I get in with the same computer from a different wifi? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor Grovel? Bribery? Threaten violence? Don't break their Ts Cs in the first place? And what has this got to do with Python? -- Cheers. Mark Lawrence. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] What can I do if I'm banned from a website??
Hi On 10 October 2012 21:35, Benjamin Fishbein bfishbei...@gmail.com wrote: I've been scraping info from a website with a url program I wrote. But now I can't open their webpage, no matter which web browser I use. I think they've somehow blocked me. How can I get back in? Is it a temporary block? And can I get in with the same computer from a different wifi? Hard to know for certain what they've done, perhaps they've blocked your IP. You can try connecting from another IP and see if that works. 2 points: 1) If you're going to be scraping websites, you should always play nice with the web-server -- throttle your requests (put some random delay between them) so they don't hammer the web-server too hard. Not doing this will enrage any webmaster. He'll be very quick to figure out why his website's being hammered, from where (the IP) and then block you. You'd probably do the same if you ran a website and you noticed some particular IP hammering your site.. 2) You should ideally always respect websites wishes regarding bots and scraping. If they don't want automated bots to be scraping them then you should really not scrape that site. And if you're going to disregard their wishes and scrape it anyway (not recommended), then all bets are off and you'll have to fly under the radar and ensure that your scraping app looks as much like a browser as possible (probably using modified headers that looks like what a browser will send) and behaves as much like a human operator driving a browser as possible, or you'll find yourself blocked as you've experienced above. Walter ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] What can I do if I'm banned from a website??
On 11/10/12 07:35, Benjamin Fishbein wrote: I've been scraping info from a website with a url program I wrote. But now I can't open their webpage, no matter which web browser I use. I think they've somehow blocked me. How can I get back in? Is it a temporary block? How the hell would we know??? Ask the people running the web site. If you have been breaking the terms and conditions of the web site, you could have broken the law (computer trespass). I don't say this because I approve of or agree with the law, but when you scrape websites with anything other than a browser, that's the chance you take. And can I get in with the same computer from a different wifi? *rolls eyes* You've been blocked once. You want to get blocked again? A lot of this depends on what the data is, why it is put on the web in the first place, and what you intend doing with it. Wait a week and see if the block is undone. Then: * If the web site gives you an official API for fetching data, USE IT. * If not, keep to their web site TC. If the TC allows scraping under conditions (usually something on the lines of limiting how fast you can scrape, or at what times), OBEY THOSE CONDITIONS and don't be selfish. * If you think the webmaster will be reasonable, ask permission first. (I don't recommend that you volunteer the information that you were already blocked once.) If he's not a dick, he'll probably say yes, under conditions (again, usually to do with time and speed). * If you insist in disregarding their TC, don't be a dick about it. Always be an ethical scraper. If the police come knocking, at least you can say that you tried to avoid any harm from your actions. It could make the difference between jail and a good behaviour bond. - Make sure you download slowly: pause for at least a few seconds between each download, or even a minute or three. - Limit the rate that you download: you might be on high speed ADSL2, but the faster you slurp files from the website, the less bandwidth they have for others. - Use a cache so you aren't hitting the website again and again for the same files. - Obey robots.txt. Consider using a random pause between (say) 0 and 90 seconds between downloads to to more accurately mimic a human using a browser. Also consider changing your user-agent. Ethical scraping suggests putting your contact details in the user-agent string. Defensive scraping suggests mimicking Internet Explorer as much as possible. More about ethical scraping: http://stackoverflow.com/questions/4384493/how-can-i-ethically-and-legally-scrape-data-from-a-public-web-site -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor