[Bug-wget] Filtering of page requisites

Dale R. Worley Wed, 12 Oct 2016 07:51:24 -0700

So I've run into another version of the problem:  I'm using
--page-requisites, and they're getting filtered in much the same way as
redirections.  However, the new fixes don't change that behavior.


The example case is that
    $ wget --mirror --convert-links --page-requisites --limit-rate=20k \
        --include-directories=/assignments \
        http://www.iana.org/assignments/index.html
does not fetch the CSS specified by
http://www.iana.org/assignments/index.html in
        <link rel="stylesheet" media="screen" href="../_css/2015.1/screen.css"/>
which is http://www.iana.org/_css/2015.1/screen.css.

It looks like requisite URLs are flagged with link_inline_p of struct
urlpos true.  If that flag is set and opt.page_requisites is set, then
test 4 of download_child is suppressed (which is the --no-parent test).

This change seems to add the same logic as is applied to redirections:

diff --git a/src/recur.c b/src/recur.c
index 1469e31..b1f9109 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -462,6 +462,12 @@ retrieve_tree (struct url *start_url_parsed, struct iri 
*pi)
 
                   r = download_child (child, url_parsed, depth,
                                       start_url_parsed, blacklist, i);
+                 if (child->link_inline_p &&
+                     (reason == WG_RR_LIST || reason == WG_RR_REGEX))
+                   {
+                     DEBUGP (("Ignoring decision for page requisite, decided 
to load it.\n"));
+                     reason = WG_RR_SUCCESS;
+                   }
                   if (r == WG_RR_SUCCESS)
                     {
                       ci = iri_new ();

and it has the expected effect, the requisites for index.html are
downloaded.

I've attached a patch for this that includes an update to the manual page.
Although the update to the manual page doesn't mention the suppression
of the --no-parent test.

Dale

diff --git a/doc/wget.texi b/doc/wget.texi
index f42773e..04d1562 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -2289,7 +2289,11 @@ wget -p http://@var{site}/1.html
 @end example
 
 Note that Wget will behave as if @samp{-r} had been specified, but only
-that single page and its requisites will be downloaded.  Links from that
+that single page and its requisites will be downloaded.
+(As with @samp{-r}, the @samp{--include-directories},
+@samp{--exclude-directories}, @samp{--accept-regex}, and @samp{--reject-regex}
+tests are not applied to page requisites.)
+Links from that
 page to external documents will not be followed.  Actually, to download
 a single page and all its requisites (even if they exist on separate
 websites), and make sure the lot displays properly locally, this author
diff --git a/src/recur.c b/src/recur.c
index 1469e31..fdb1d2e 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -462,6 +462,12 @@ retrieve_tree (struct url *start_url_parsed, struct iri *pi)
 
                   r = download_child (child, url_parsed, depth,
                                       start_url_parsed, blacklist, i);
+		  if (child->link_inline_p &&
+		      (r == WG_RR_LIST || r == WG_RR_REGEX))
+		    {
+		      DEBUGP (("Ignoring decision for page requisite, decided to load it.\n"));
+		      r = WG_RR_SUCCESS;
+		    }
                   if (r == WG_RR_SUCCESS)
                     {
                       ci = iri_new ();

[Bug-wget] Filtering of page requisites

Reply via email to