In regard to my difficulties with recursively retrieving
http://www.iana.org/assignments/index.html:  I discovered that one URL
(http://www.iana.org/assignments/forces/forces.xhtml) is pointed to by
no less than three different URLs:

http://www.iana.org/assignments/forces/forces.xhtml
http://www.iana.org/assignments/forces-parameters/forces-parameters.xhtml
http://www.iana.org/assignments/forces

The first is the proper URL for it, and the second two are redirected to
the first URL.

There are several other occurrences of this situation.

And I discovered that if I specify --trust-server-names, then wget will
realize that the redirection URL can be retrieved once, and links to the
other two URLs can be directed to that one file.  Without
--trust-server-names, wget considers all three URLs to be different,
despite that they are redirected to the same URL, and dutifully stores
essentially the same content three times.  With --trust-server-names,
wget understands that all three URLs are the same.

It turns out that this provides me with a much better mirror of the web
site.

I've attached a patch that improves the documentation of
--trust-server-names, to clarify that if -nd is not in effect, then the
file name is constructed from the entire redirection URL, not just the
last component.

(--trust-server-names is also mentioned in doc/metalink-standard.txt,
but that text does not seem to me to have the problem the patch
corrects.)

Dale
>From 740c68d4d820334362dc93ce5c31b9d742f12558 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Wed, 2 Nov 2016 12:14:46 -0400
Subject: [PATCH] Improve documentation of --trust-server-names.

---
 doc/wget.texi | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/doc/wget.texi b/doc/wget.texi
index 91219e5..3632fd1 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -1700,9 +1700,11 @@ with a http status code that indicates error.
 @cindex Trust server names
 @item --trust-server-names
 
-If this is set to on, on a redirect the last component of the
-redirection URL will be used as the local file name.  By default it is
-used the last component in the original URL.
+If this is set, on a redirect, the local file name will be based
+on the redirection URL.  By default the local file name is is based on
+the original URL.  When doing recursive retrieving this can be helpful
+because in many web sites redirected URLs correspond to an underlying
+file structure, while link URLs do not.
 
 @cindex authentication
 @item --auth-no-challenge
@@ -3261,8 +3263,8 @@ Turn on recognition of the (non-standard) @samp{Content-Disposition}
 HTTP header---if set to @samp{on}, the same as @samp{--content-disposition}.
 
 @item trust_server_names = on/off
-If set to on, use the last component of a redirection URL for the local
-file name.
+If set to on, construct the local file name from redirection URLs
+rather than original URLs.
 
 @item continue = on/off
 If set to on, force continuation of preexistent partially retrieved
-- 
1.8.3.1

Reply via email to