[Bug-wget] avoiding a large number of HEAD reqs when resuming

User Goblin Thu, 30 Apr 2015 09:23:42 -0700

The situation: I'm trying to resume a large recursive download of a site
with many files (-r -l 10 -c)


The problem: When resuming, wget issues a large number of HEAD requests
for each file that it already downloaded. This triggers the upstream firewall,
making the download impossible.

My initial idea was to parse wget's -o output and figure out which files
still need to be downloaded, and then feed them via -i when continuing the
download. This led me to the conclusion that I'd need two pieces of
functionality, (1) machine-parseable output of -o, and (2) a way to convert
a partially downloaded directory structure to links that still need
downloading.

I could work around (1), the output of -o is just hard to parse.

For (2), I could use lynx or w3m or something like that, but then I never
am sure that the links produced are the same that wget produced. Therefore
I'd love an option like `wget --extract-links ./index.html` that'd just
read an html file and produce a list of links on output. Or perhaps an
assertion that some other tool like urlscan will do it exactly the same way
as wget.

There's a third idea that we discussed on IRC with darnir, namely having
wget store its state when downloading. That would solve the original problem
and would be pretty nice. However, I'd still like to have (1) and (2) done,
because I'm also thinking of distributing this large download to a number
of IP addresses, by running many instances of wget on many different
servers (and writing a script that'd distribute the load).

Thoughts welcome :-)

[Bug-wget] avoiding a large number of HEAD reqs when resuming

Reply via email to