Re: [Tutor] What can I do if I'm banned from a website??

2012-10-10 Thread James Reynolds
On Wed, Oct 10, 2012 at 4:35 PM, Benjamin Fishbein bfishbei...@gmail.comwrote:

 I've been scraping info from a website with a url program I wrote. But now
 I can't open their webpage, no matter which web browser I use. I think
 they've somehow blocked me. How can I get back in? Is it a temporary block?
 And can I get in with the same computer from a different wifi?

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


A few thoughts:

1. Try using a proxy.
2. Ask the webadmin (nicely) to unban you
3. Use requests / urllib3 to se connection pooling
4. See if the site has an API designed for data extraction.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] What can I do if I'm banned from a website??

2012-10-10 Thread c smith
how could someone know enough to write their own web-scraping program and
NOT know that this is not about python or how to get around this problem?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] What can I do if I'm banned from a website??

2012-10-10 Thread Mark Lawrence

On 10/10/2012 21:35, Benjamin Fishbein wrote:

I've been scraping info from a website with a url program I wrote. But now I 
can't open their webpage, no matter which web browser I use. I think they've 
somehow blocked me. How can I get back in? Is it a temporary block? And can I 
get in with the same computer from a different wifi?

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



Grovel?  Bribery?  Threaten violence?  Don't break their Ts  Cs in the 
first place? And what has this got to do with Python?


--
Cheers.

Mark Lawrence.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] What can I do if I'm banned from a website??

2012-10-10 Thread Walter Prins
Hi

On 10 October 2012 21:35, Benjamin Fishbein bfishbei...@gmail.com wrote:
 I've been scraping info from a website with a url program I wrote. But now I 
 can't open their webpage, no matter which web browser I use. I think they've 
 somehow blocked me. How can I get back in? Is it a temporary block? And can I 
 get in with the same computer from a different wifi?

Hard to know for certain what they've done, perhaps they've blocked
your IP. You can try connecting from another IP and see if that works.

2 points:
1) If you're going to be scraping websites, you should always play
nice with the web-server -- throttle your requests (put some random
delay between them) so they don't hammer the web-server too hard.  Not
doing this will enrage any webmaster.  He'll be very quick to figure
out why his website's being hammered, from where (the IP) and then
block you.  You'd probably do the same if you ran a website and you
noticed some particular IP hammering your site..
2) You should ideally always respect websites wishes regarding bots
and scraping.   If they don't want automated bots to be scraping them
then you should really not scrape that site.  And if you're going to
disregard their wishes and scrape it anyway (not recommended), then
all bets are off and you'll have to fly under the radar and ensure
that your scraping app looks as much like a browser as possible
(probably using modified headers that looks like what a browser will
send) and behaves as much like a human operator driving a browser as
possible, or you'll find yourself blocked as you've experienced above.

Walter
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] What can I do if I'm banned from a website??

2012-10-10 Thread Steven D'Aprano

On 11/10/12 07:35, Benjamin Fishbein wrote:


I've been scraping info from a website with a url program I wrote. But
now I can't open their webpage, no matter which web browser I use. I
think they've somehow blocked me. How can I get back in? Is it a
temporary block?


How the hell would we know??? Ask the people running the web site.

If you have been breaking the terms and conditions of the web site, you
could have broken the law (computer trespass). I don't say this because
I approve of or agree with the law, but when you scrape websites with
anything other than a browser, that's the chance you take.



And can I get in with the same computer from a different wifi?


*rolls eyes*

You've been blocked once. You want to get blocked again?

A lot of this depends on what the data is, why it is put on the web in
the first place, and what you intend doing with it.

Wait a week and see if the block is undone. Then:

* If the web site gives you an official API for fetching data, USE IT.

* If not, keep to their web site TC. If the TC allows scraping under
  conditions (usually something on the lines of limiting how fast you can
  scrape, or at what times), OBEY THOSE CONDITIONS and don't be selfish.

* If you think the webmaster will be reasonable, ask permission first.
  (I don't recommend that you volunteer the information that you were
  already blocked once.) If he's not a dick, he'll probably say yes,
  under conditions (again, usually to do with time and speed).

* If you insist in disregarding their TC, don't be a dick about it.
  Always be an ethical scraper. If the police come knocking, at least
  you can say that you tried to avoid any harm from your actions. It
  could make the difference between jail and a good behaviour bond.

  - Make sure you download slowly: pause for at least a few seconds
between each download, or even a minute or three.

  - Limit the rate that you download: you might be on high speed ADSL2,
but the faster you slurp files from the website, the less bandwidth
they have for others.

  - Use a cache so you aren't hitting the website again and again for
the same files.

  - Obey robots.txt.


Consider using a random pause between (say) 0 and 90 seconds between
downloads to to more accurately mimic a human using a browser. Also
consider changing your user-agent. Ethical scraping suggests putting
your contact details in the user-agent string. Defensive scraping
suggests mimicking Internet Explorer as much as possible.

More about ethical scraping:

http://stackoverflow.com/questions/4384493/how-can-i-ethically-and-legally-scrape-data-from-a-public-web-site



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor