Re: [Bug-wget] wget2 Feature Suggestion - Triston Line

2018-07-13 Thread Triston Line
Hi Tim,

Excellent answer thank you very much for this info, "-N" or
"--timestamping" sounds like a much better way to go, however if I'm
converting links, using wget (1) I think I've read somewhere and noticed
that two separate commands running in series wouldn't be able to continue
due to the links from the previous session/command-instance? More clearly,
I've read that the primary reason continuing from a fault is impossible is
due to the fact that converting links to mirror isn't something that can be
continued and the links are only valid for that session. Sounds silly to me
because you're just formatting  tags from my understanding but
there's probably a bit more to it.

I have used max-threads in the past and I've tried a suggestion for xargs
on one of the stack exchange forums, so I do toy with those settings while
testing out my friend's servers at UBC. Government on the other hand I
might get in a bit of trouble if I'm loading them during working hours
(Gosh knows I don't wanna come in at some ungodly hour (e.g. 3 am) with the
network-services team to toy around with their stuff at different sites or
perform intranet backups around different sites from my local).

" The server then only sends payload/data if it has a newer version of that
document, else it responds with 304 Not Modified." This is 400 Bytes to
respond with the last modification date of a file? I'm aware FTP uses open
timestamps on files, but do most apache/nginx servers? I'll query the gov
about their policy but I somehow doubt it's complicated (Or maybe a
security auditor came in and messed it up, seems to be the way around here!
"It's all default or it's specialized to the point of gibberish").

4MB of Extra download is very little to me, I've filled a few TB on a home
server just looping different loaded thread counts and them
remote-form/cloud-application queries (with packages separate and apart
from wget). Actually I was working on one of our old FTP servers, it had a
backup of our local reports to Environment Canada and NOAA, the log file
for WGET was well over 350MB of text and I can't remember what the
provincial and federal legislative "backups" was but holy crap that log was
longer than the circumference of the earth at 3 point font. I say backups
because there's a better way than to mirror the site but "no that would
take too much paperwork and this is a better work around".

Thanks Tim, I will toy with these new options over the weekend, I was
actually wondering about updates to site mapping and site-probing. In the
mean time I have to make phone apps for emergency preparedness within
community health services *eyeroll* :P I really appreciate your work on
this package by the way, as you can tell wget has helped me in many
endeavors and has clearly improved how the government and much of society
operate :)

Triston



On Fri, Jul 13, 2018 at 2:34 AM, Tim Rühsen  wrote:

> On 07/12/2018 08:12 PM, Triston Line wrote:
> > If that's possible that would help immensely. I "review" sites for my
> > friends at UBC and we look at geographic performance on their apache and
> > nginx servers, the only problem is they encounter minor errors from time
> to
> > time while recursively downloading (server-side errors nothing to do with
> > wget) so the session ends.
>
> Just forgot: Check out Wget2's --stats-site option. It gives you
> statistical information about all pages downloaded, including parent
> (linked from), status, size, compression, timing, encoding and a few
> more. You can visualize with graphviz or put the data into a database
> for easy analysis.
>
> Example:
> $ wget2 --stats-site=csv:site.csv -r -p https://www.google.com
> $ cat site.csv
> ID,ParentID,URL,Status,Link,Method,Size,SizeDecompressed,
> TransferTime,ResponseTime,Encoding,Verification
> 1,0,https://www.google.com/robots.txt,200,1,1,1842,6955,33,33,1,0
> 2,0,https://www.google.com,200,1,1,4637,10661,83,83,1,0
> 4,2,https://www.google.com/images/branding/product/ico/
> googleg_lodp.ico,200,1,1,1494,5430,32,31,1,0
> 5,2,https://www.google.com/images/branding/googlelogo/1x/
> googlelogo_white_background_color_272x92dp.png,200,1,1,5482,5482,36,36,0,0
> 3,2,https://www.google.com/images/nav_logo229.png,200,1,
> 1,12263,12263,59,58,0,0
>
> Regards, Tim
>
>


Re: [Bug-wget] request to change retry default

2018-07-13 Thread Tim Rühsen
On 07/08/2018 02:59 AM, John Roman wrote:
> Greetings,
> I wish to discuss a formal change of the default retry for wget from 20
> to something more pragmatic such as two or three.
> 
> While I believe 20 retries may have been the correct default many years
> ago, it seems overkill for the modern "cloud based" internet, where most 
> sites are
> backed by one or more load balancers.  Geolocateable A records further
> reduce the necessity for retries by providing a second or third option
> for browsers to try.  To a lesser extent, GTM and GSLB technologies
> (however maligned they may be) are sufficient as well to
> properly handle failures for significant amounts of traffic.  BGP
> network technology for large hosting providers has also further reduced
> the need to perform several retries to a site.  Finally, for better or
> worse, environments such as Kubernetes and other container orchestration
> tools seem to afford sites an unlimited uptime should the marketing be
> trusted.

Solution: Just add 'tries = 3' to /etc/wgetrc or to ~/.wgetrc and never
care for it again.

But I wonder myself a bit about your request... if 3 tries would always
be enough to catch a file safely, then it doesn't matter if tries is set
to 20, 20.000 or even unlimited. Is there something you might have
forgotten to write !?

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] wget2 Feature Suggestion - Triston Line

2018-07-13 Thread Tim Rühsen
On 07/12/2018 08:12 PM, Triston Line wrote:
> If that's possible that would help immensely. I "review" sites for my
> friends at UBC and we look at geographic performance on their apache and
> nginx servers, the only problem is they encounter minor errors from time to
> time while recursively downloading (server-side errors nothing to do with
> wget) so the session ends.

Just forgot: Check out Wget2's --stats-site option. It gives you
statistical information about all pages downloaded, including parent
(linked from), status, size, compression, timing, encoding and a few
more. You can visualize with graphviz or put the data into a database
for easy analysis.

Example:
$ wget2 --stats-site=csv:site.csv -r -p https://www.google.com
$ cat site.csv
ID,ParentID,URL,Status,Link,Method,Size,SizeDecompressed,TransferTime,ResponseTime,Encoding,Verification
1,0,https://www.google.com/robots.txt,200,1,1,1842,6955,33,33,1,0
2,0,https://www.google.com,200,1,1,4637,10661,83,83,1,0
4,2,https://www.google.com/images/branding/product/ico/googleg_lodp.ico,200,1,1,1494,5430,32,31,1,0
5,2,https://www.google.com/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png,200,1,1,5482,5482,36,36,0,0
3,2,https://www.google.com/images/nav_logo229.png,200,1,1,12263,12263,59,58,0,0

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] wget2 Feature Suggestion - Triston Line

2018-07-13 Thread Tim Rühsen
Hi,

On 07/12/2018 08:12 PM, Triston Line wrote:
> Hi Wget team,
> 
> I am but a lowly user and linux sysadmin, however, after noticing the wget2
> project I have wondered about a feature that could be added to the new
> version.
> 
> I approve of all the excellent new features already being added (especially
> the PFS, Shoutcast and scanning features), but has there been any
> consideration about continuing a "session" (Not a cookie session, a
> recursive session)? Perhaps retaining the last command in a backup/log file
> with the progress it last saved or if a script/command is interrupted and
> entered again in the same folder, wget will review the existing files
> before commencing the downloads and or link conversion depending on what
> stage of the "session" it was at.

-N/--timestamping nearly does what you need. If a page to download
already exists locally, wget2 (also newer versions of wget) adds the
If-Modified-Since HTTP header to the GET request. The server then only
sends payload/data if it has a newer version of that document, else it
responds with 304 Not Modified.

That is ~400 bytes per page, so just 400k bytes per 1000 pages.
Depending on the server's power and your bandwidth, you can increase the
number of parallel connections with --max-threads.

> If that's possible that would help immensely. I "review" sites for my
> friends at UBC and we look at geographic performance on their apache and
> nginx servers, the only problem is they encounter minor errors from time to
> time while recursively downloading (server-side errors nothing to do with
> wget) so the session ends.

Some server errors, e.g. like 404 or 5xx will prevent wget from trying
again that page. Wget2 has just recently got --retry-on-http-status to
change this behavior (see the docs for an example, also --tries).

> The other example I have is while updating my recursive downloads, we
> encounter power-failures during winter storms and from time to time very
> large recursions are interrupted and it feels bad downloading a web portal
> your team made together consisting of roughly 25,000 or so web pages and at
> the 10,000th page mark your wget session ends at like 3am. (Worse than
> stepping on lego I promise).

See above (-N). 1 pages would mean 4Mb of extra download then...
plus a few minutes. Let me know if you still think that this is a
problem. A 'do-not-download-local-files-again' option wouldn't be too
hard to implement. But the -N option is perfect for syncing - it just
downloads what has changed since the last time.

Caveat: Some server's don't support the If-Modified-Since header, which
is pretty stupid and normally just a server side configuration knob.

Regards, Tim



signature.asc
Description: OpenPGP digital signature