The situation: I'm trying to resume a large recursive download of a site with many files (-r -l 10 -c)
The problem: When resuming, wget issues a large number of HEAD requests for each file that it already downloaded. This triggers the upstream firewall, making the download impossible. My initial idea was to parse wget's -o output and figure out which files still need to be downloaded, and then feed them via -i when continuing the download. This led me to the conclusion that I'd need two pieces of functionality, (1) machine-parseable output of -o, and (2) a way to convert a partially downloaded directory structure to links that still need downloading. I could work around (1), the output of -o is just hard to parse. For (2), I could use lynx or w3m or something like that, but then I never am sure that the links produced are the same that wget produced. Therefore I'd love an option like `wget --extract-links ./index.html` that'd just read an html file and produce a list of links on output. Or perhaps an assertion that some other tool like urlscan will do it exactly the same way as wget. There's a third idea that we discussed on IRC with darnir, namely having wget store its state when downloading. That would solve the original problem and would be pretty nice. However, I'd still like to have (1) and (2) done, because I'm also thinking of distributing this large download to a number of IP addresses, by running many instances of wget on many different servers (and writing a script that'd distribute the load). Thoughts welcome :-)