Re: [Tutor] can I walk or glob a website?

Hugo Arts Wed, 18 May 2011 11:05:02 -0700

On Wed, May 18, 2011 at 7:32 PM, Albert-Jan Roskam <fo...@yahoo.com> wrote:
>
> ===> Thanks for your reply. I tried wget, which seems to be a very handy
> tool. However, it doesn't work on this particular site. I tried wget -e
> robots=off -r -nc --no-parent -l6 -A.pdf
> 'http://www.landelijkregisterkinderopvang.nl/' (the quotes are there because
> I originally used a deeper link that contains ampersands). I also tested it
> on python.org, where it does work. Adding -e robots=off didn't work either.
> Do you think this could be a protection from the administrator?
>


wget works by recursively following hyperlinks from the page you
supply. The page you entered leads to a search form (which wget
wouldn't know how to fill out) but nothing else, so wget cannot
retrieve any of the pdf documents.

I think your best approach is the brute-force id generation you
mentioned earlier. be polite about this: wait a few seconds after four
consecutive failed attempts, download only one pdf at a time, wait a
second or two after each download, that kind of thing. Just don't
flood the server.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] can I walk or glob a website?

Reply via email to