I'm a novice Python programmer, and I've been looking for a way to collect archived web pages. I would like to use the data on Internet Archive, via the "Wayback Machine". Look, for example, at http://web.archive.org/web/*/http://www.python.org <http://web.archive.org/web/*/http:/www.python.org> . I'd like to crawl down the first few levels of links of each of the updated archived pages (the ones with *'s next to them). The site's robots.txt exclusions are complete, so a screen-scraping strategy doesn't seem doable.
Does anyone have any suggestions for a way to go about this pythonically? Many thanks, Nick
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor