I'm a novice Python programmer, and I've been looking for a way to
collect archived web pages. I would like to use the data on Internet
Archive, via the "Wayback Machine". Look, for example, at
http://web.archive.org/web/*/http://www.python.org
<http://web.archive.org/web/*/http:/www.python.org> . I'd like to crawl
down the first few levels of links of each of the updated archived pages
(the ones with *'s next to them). The site's robots.txt exclusions are
complete, so a screen-scraping strategy doesn't seem doable. 

 

Does anyone have any suggestions for a way to go about this
pythonically? 

 

Many thanks,

Nick

 

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to