On Wed, May 18, 2011 at 7:32 PM, Albert-Jan Roskam <fo...@yahoo.com> wrote: > > ===> Thanks for your reply. I tried wget, which seems to be a very handy > tool. However, it doesn't work on this particular site. I tried wget -e > robots=off -r -nc --no-parent -l6 -A.pdf > 'http://www.landelijkregisterkinderopvang.nl/' (the quotes are there because > I originally used a deeper link that contains ampersands). I also tested it > on python.org, where it does work. Adding -e robots=off didn't work either. > Do you think this could be a protection from the administrator? >
wget works by recursively following hyperlinks from the page you supply. The page you entered leads to a search form (which wget wouldn't know how to fill out) but nothing else, so wget cannot retrieve any of the pdf documents. I think your best approach is the brute-force id generation you mentioned earlier. be polite about this: wait a few seconds after four consecutive failed attempts, download only one pdf at a time, wait a second or two after each download, that kind of thing. Just don't flood the server. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor