Hi, On 07/12/2018 08:12 PM, Triston Line wrote: > Hi Wget team, > > I am but a lowly user and linux sysadmin, however, after noticing the wget2 > project I have wondered about a feature that could be added to the new > version. > > I approve of all the excellent new features already being added (especially > the PFS, Shoutcast and scanning features), but has there been any > consideration about continuing a "session" (Not a cookie session, a > recursive session)? Perhaps retaining the last command in a backup/log file > with the progress it last saved or if a script/command is interrupted and > entered again in the same folder, wget will review the existing files > before commencing the downloads and or link conversion depending on what > stage of the "session" it was at.
-N/--timestamping nearly does what you need. If a page to download already exists locally, wget2 (also newer versions of wget) adds the If-Modified-Since HTTP header to the GET request. The server then only sends payload/data if it has a newer version of that document, else it responds with 304 Not Modified. That is ~400 bytes per page, so just 400k bytes per 1000 pages. Depending on the server's power and your bandwidth, you can increase the number of parallel connections with --max-threads. > If that's possible that would help immensely. I "review" sites for my > friends at UBC and we look at geographic performance on their apache and > nginx servers, the only problem is they encounter minor errors from time to > time while recursively downloading (server-side errors nothing to do with > wget) so the session ends. Some server errors, e.g. like 404 or 5xx will prevent wget from trying again that page. Wget2 has just recently got --retry-on-http-status to change this behavior (see the docs for an example, also --tries). > The other example I have is while updating my recursive downloads, we > encounter power-failures during winter storms and from time to time very > large recursions are interrupted and it feels bad downloading a web portal > your team made together consisting of roughly 25,000 or so web pages and at > the 10,000th page mark your wget session ends at like 3am. (Worse than > stepping on lego I promise). See above (-N). 10000 pages would mean 4Mb of extra download then... plus a few minutes. Let me know if you still think that this is a problem. A 'do-not-download-local-files-again' option wouldn't be too hard to implement. But the -N option is perfect for syncing - it just downloads what has changed since the last time. Caveat: Some server's don't support the If-Modified-Since header, which is pretty stupid and normally just a server side configuration knob. Regards, Tim
signature.asc
Description: OpenPGP digital signature
