Re: [Bug-wget] wget2 Feature Suggestion - Triston Line

Tim Rühsen Fri, 13 Jul 2018 02:26:09 -0700

Hi,

On 07/12/2018 08:12 PM, Triston Line wrote:
> Hi Wget team,
> 
> I am but a lowly user and linux sysadmin, however, after noticing the wget2
> project I have wondered about a feature that could be added to the new
> version.
> 
> I approve of all the excellent new features already being added (especially
> the PFS, Shoutcast and scanning features), but has there been any
> consideration about continuing a "session" (Not a cookie session, a
> recursive session)? Perhaps retaining the last command in a backup/log file
> with the progress it last saved or if a script/command is interrupted and
> entered again in the same folder, wget will review the existing files
> before commencing the downloads and or link conversion depending on what
> stage of the "session" it was at.


-N/--timestamping nearly does what you need. If a page to download
already exists locally, wget2 (also newer versions of wget) adds the
If-Modified-Since HTTP header to the GET request. The server then only
sends payload/data if it has a newer version of that document, else it
responds with 304 Not Modified.

That is ~400 bytes per page, so just 400k bytes per 1000 pages.
Depending on the server's power and your bandwidth, you can increase the
number of parallel connections with --max-threads.

> If that's possible that would help immensely. I "review" sites for my
> friends at UBC and we look at geographic performance on their apache and
> nginx servers, the only problem is they encounter minor errors from time to
> time while recursively downloading (server-side errors nothing to do with
> wget) so the session ends.

Some server errors, e.g. like 404 or 5xx will prevent wget from trying
again that page. Wget2 has just recently got --retry-on-http-status to
change this behavior (see the docs for an example, also --tries).

> The other example I have is while updating my recursive downloads, we
> encounter power-failures during winter storms and from time to time very
> large recursions are interrupted and it feels bad downloading a web portal
> your team made together consisting of roughly 25,000 or so web pages and at
> the 10,000th page mark your wget session ends at like 3am. (Worse than
> stepping on lego I promise).

See above (-N). 10000 pages would mean 4Mb of extra download then...
plus a few minutes. Let me know if you still think that this is a
problem. A 'do-not-download-local-files-again' option wouldn't be too
hard to implement. But the -N option is perfect for syncing - it just
downloads what has changed since the last time.

Caveat: Some server's don't support the If-Modified-Since header, which
is pretty stupid and normally just a server side configuration knob.

Regards, Tim

signature.asc
Description: OpenPGP digital signature

Re: [Bug-wget] wget2 Feature Suggestion - Triston Line

Reply via email to