from:"Dale R. Worley"

Re: [Bug-wget] Read error (Success)?

2018-11-20 Thread Dale R. Worley

"Tony Lewis"  writes:
> I'm getting the following error and don't understand what it's trying to
> tell me:
>
> Read error at byte 97430 (Success)
>
> What could the server be doing to cause wget to report an error with the
> details being "Success"?

My guess is that the server's response starts with a message including
the description "Success".

> For what it's worth, the page in question is coming from WordPress and the
> PHP script that's generating the page did in fact emit 97430 bytes of data.
> Could the server be leaving the port in some weird state?

An easy way to help us is to report exactly the wget command line you
used, so we can replicate what you did and see what happens.  (As
opposed to now, when we're stuck making guesses as to what might have
happened.)

Dale

Re: [Bug-wget] wget in a 'dynamic' pipe

2018-09-12 Thread Dale R. Worley

Paul Wagner  writes:
> That's what the OP thinks, too.  I attributed the slow startup to DNS 
> resolution.

Depending on your circumstances, one way to fix that is set up a local
caching-only DNS server.  Direct ordinary processes to use that.  Then
the first lookup is expensive, but the caching server saves the
resolution and answers later queries very quickly.

Dale

Re: [Bug-wget] wget in a 'dynamic' pipe

2018-09-11 Thread Dale R. Worley

Tim Rühsen  writes:
> Thanks for the pointer to coproc, never heard of it ;-) (That means I
> never had a problem that needed coproc).
>
> Anyways, copy the script results in a file '[1]' with bash 4.4.23.

Yeah, I'm not surprised there are bugs in it.

> Also, wget -i - waits with downloading until stdin has been closed. How
> can you circumvent that ?

The more I think about the original problem, the more puzzled I am.  The
OP said that starting wget for each URL took a long time.  But my
experience is that starting processes is quite quick.  (I once modified
tar to compress each file individually with gzip before writing it to an
Exabyte type.  On a much slower processor than modern processors, the
writing was not delayed by starting a process for each file written.)

I suspect the delay is not starting wget but establishing the initial
HTTP connection to the server.

Probably a better approach to the problem is to download the files in
batches on N consecutive URLs, where N is large enough that the HTTP
startup time is well less than the total download time.  Process each
batch with a seperate invocation of wget, and exit the loop when an
attempted batch doesn't create any new downloaded files (or, the last
file in the batch doesn't exist), indicating there are no more files to
download.

Dale

Re: [Bug-wget] wget in a 'dynamic' pipe

2018-09-10 Thread Dale R. Worley

Paul Wagner  writes:
> Now I tried
>
>{ i=1; while [[ $i != 100 ]]; do echo 
> "http://domain.com/path/segment_$((i++)).mp4"; done } | wget -O foo.mp4 
> -i -
>
> which works like a charm *as long as the 'generator process' is finite*, 
> i.e. the loop is actually programmed as in the example.  The problem is 
> that it would be much easier if I could let the loop run forever, let 
> wget get whatever is there and then fail after the counter extends to a 
> segment number not available anymore, which would in turn fail the whole 
> pipe.

Good God, this finally motivates me to learn about Bash coprocesses.

I think the answer is something like this:

coproc wget -O foo.mp4 -i -

i=1
while true
do
rm -f foo.mp4
echo "http://domain.com/path/segment_$((i++)).mp4" >&$wget[1]
sleep 5
# The only way to test for non-existence of the URL is whether the
# output file exists.
[[ ! -e foo.mp4 ]] && break
# Do whatever you already do to wait for foo.mp4 to be completed and
# then use it.
done

# Close wget's input.
exec $wget[1]<&-
# Wait for it to finish.
wait $wget_pid

Dale

Re: [Bug-wget] Any explanation for the '-nc' returned value?

2018-07-29 Thread Dale R. Worley

Tim Rühsen  writes:
>-nd, even if -r or -p are in effect.)  When -nc is specified,
> this behavior is suppressed, and Wget will
>refuse to download newer copies of file.

Though strictly speaking, this doesn't say that wget will then exit with
error code 1.

Dale

Re: [Bug-wget] [curlsec] [USN-3464-1] Wget vulnerabilities

2017-12-31 Thread Dale R. Worley

Kristian Erik Hermansen  writes:
> I still contend that this is at least a bug, and potentially a
> security issue, but only when the headers are ones that should NEVER
> have multiple values.

I agree with others that it's not clear that there's a security issue
here.  It appears that wget/curl can be used to generate HTTP requests
(or pseudo-HTTP requests) that might exploit security problems in web
servers, but that's the web servers' problem, not wget's/curl's.

Certainly, making sure that wget/curl can't generate such requests
doesn't stop the black-hats from generating them by other means.

Dale

Re: [Bug-wget] wget from a subfolder of a site

2017-12-26 Thread Dale R. Worley

Remember to send the message to the bug-wget mailing list!xo

"ifffam ..."  writes:
> I have tried these two ways:
>
> wget -r --no-parent --include-directories 
> http://www.astrosurf.com/luxorion/menu-quantique.htm
>
> wget -r --include-directories 
> http://www.astrosurf.com/luxorion/menu-quantique.htm
>
> but I just got an error message (which translated would be 'URL lacking or 
> absent')

The manual page gives the format of the --include-directories options as
this:

   --include-directories=list

So you have to give the option a list of the directories that will be
included in the download, as well as the URL of the place to start.

The particular command I've used with this option is

wget --include-directories=/assignments \
http://www.iana.org/assignments/index.html

But you will likely have to experiment to discover a command that does
exactly what you want -- the manual page is not very clear about the
options that controll which URLs are excluded from downloading.

Dale

Re: [Bug-wget] wget from a subfolder of a site

2017-12-24 Thread Dale R. Worley

"ifffam ..."  writes:
> Hello! I wanted to ask you how to download just a part of a site, I
> mean, starting from a subfolder. I have tried
>
> wget -r --no-parent http://www.astrosurf.com/luxorion/menu-quantique.htm
>
> but didn't work, I mean, it downloaded the whole
> www.astrosurf.com site (more than 6GB!). I
> also tried adding -np but it didn't work either.

There are various complexities, but the --include-dirrectories option
can help.

The major limitation, of course, is that wget will only fetch files that
are pointed to by other files it downloads, so there's no guarantee
you're getting every file in the subfolder.

Dale

Re: [Bug-wget] Errors-only mode

2017-08-10 Thread Dale R. Worley

Tim Rühsen  writes:
> wget -o/dev/null URL
>
> You can check for errors via the $? (on Linux, there should be something 
> comparable on other systems).

Yes, but since I'm running it in crontab, I really do want to have
output if an error occurs.  Of course, I can rig that by testing $?, but
I'd rather see something like grep's behavior, where there is output to
stderr if and only if an error was detected.

Dale

[Bug-wget] Errors-only mode

2017-08-09 Thread Dale R. Worley

Is there a way of invoking wget that produces no output if the operation
is successful (let's assuming that I'm fetching exactly one URL) and
produces appropriate error messages if it is not?

I would have thought this was easy, but I can't figure out how to do
it.  Even redirecting stdout to null doesn't work, since wget sends a
final success message to stderr:

2017-08-09 22:00:39 URL:http://www.gocomics.com/calvinandhobbes [99643] -> 
"calvinandhobbes.4" [1]

Dale

Re: [Bug-wget] other hosts default

2017-07-07 Thread Dale R. Worley

> kalle  writes:
>> well. then the info-document should be changed, so that it doesn't make 
>> the impression on the reader, that  it would be so. at least, this is 
>> how I interpeted it. and I was puzzled quite a time with this problem, 
>> that it seemed to not fit as described.

The effective way to do this is propose as a patch the specific changes
in the documentation that you would like to see.

Dale

Re: [Bug-wget] other hosts default

2017-07-06 Thread Dale R. Worley

kalle  writes:
> This e-mail is written in simplified experimental phonetical english. 
> For remarks on understandibility you are welcome to give feedback to the 
> author. And : I hope you aren't too annoyed.

> from dhe info-dokiument abawt "wget" y anderstud, dhat with dhe 
> riikersiv (recursive) optshn it shud alsow dawnlowd al links to adha 
> hosts. naw y jast red dhe FAQ and dher id iz sed, dhat dhis iz not dhe 
> keis by difoot.

What is the question?  As you state it, the document says that with
"recursive", wget will download links to other hosts, and which means,
of course, that the default behavior of wget (that is, without any
options) is not to do so.

In regard to phonetical English:  A characteristic of the traditional
Boston (Mass., US -- where I live) dialects of English is that "er" at
the end of words is pronounced without any "r" effect, which is in
marked contrast to Standard American English.  And I notice that you
have written "other" in that way:  "adha".  (Assuming that "dh" is used
for "edh" (voiced dental fricative).)

Which leads to the major problem with phonetic transcriptions, they
reflect the dialect of the speaker.  Languages with broad usage tend to
have more uniformity in their formal written form then their spoken
forms.  (Arabic and Chinese show this effect particularly strongly,
where the spoken versions aren't even mutually understandable.)

Dale

Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-14 Thread Dale R. Worley

L A Walsh <w...@tlinx.org> writes:
> Dale R. Worley wrote:
>>  But of course, no [RFC3986-conforming] URL
>>  contains an embedded space because that's what it
>>  says in RFC 3986, which is "what *defines* what a
>>  URL *is*"[sic; should read "is one definition of
> a URL.
> ---
> Right, just like speed limit signs define
> what the maximum speed is.
>
> There is the "model" and there is reality.  To believe that
> the model replaces and/or dictates reality is not
> realistic and bordering on some mental pathology.
>
> I understand what you are saying Dale.  My dad was a lawyer,
> and life would be so much easier if specs, RFCs or other
> models of reality were the only thing we had to pay attention
> to.  But... to do so generally creates various levels of
> discomfort and/or headaches.

There's a reason why the Internet has advanced on the back of thousands
of anal-retentive standards documents.

There really are situations where DWIM (Do What I Mean) design makes
life worse.  It's plausible that in a web browser it's reasonable to
allow users to type in purported URLs that are invalid, and for the
browser to make its best guess as to what the user meant.  This is
because getting the guess wrong rarely causes troubles beyond showing
the user a page that they aren't interested in; the user can just retype
the right URL and get what they wanted.

But every such slackness introduces uncertainty.  If the user types
"http://www.example.com/ " (that is, with a trailing space), should it
be handled as "http://www.example.com/%20; (assuming the user wanted to
access a file whose name is a single space, and providing the URL that
does that) or "http://www.example.com/; (assuming that the space is a
cut-and-paste error and should be ignored).

As long as this is being directly monitored by the user, this works
reasonably well.  But once the DWIM program starts being used as a
*part* of a system, things get hazardous.  People start building other
parts of the system assuming that the DWIM program doesn't hold them to
the rules.  And since the DWIM program's behavior in those
outside-the-box cases isn't clearly defined, there's no protection from
the situation where its guesses change, but the rest of the system
depends on *particular* guesses that it used to make.

In the particular case of wget, consider that portions of the URL that
the user enters are extracted and used in the HTTP request.  Again,
there's a strict specification of what constitutes a valid HTTP request.
If the user includes an invalid character in the URL, should wget simply
pass it through into the HTTP request, assuming that a well-built web
browser will Do What the User (probably) Meant?

And it should be remembered that there's a design principle of Unix
that's rarely mentioned:  People write a lot of shell scripts for Unix,
and the external interface of Unix commands is optimized for use within
shell scripts, not for being directly executed by users.  That's why
most of them provide no output whatever if their execution is
successful, and why most of them that do generate output provide no
"headers" -- that would get in the way of handing the output to another
program as input.  I've even seen an exercise in a Unix training book
asking the student to explain why the single header line in the output
of the "ps" command is undesirable.

Within that context, the point of wget is to fetch the contents of a URL
that is provided by something else that *should* know what a properly
formed URL is.

Dale

Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-13 Thread Dale R. Worley

L A Walsh  writes:
>> But of course, no URL contains an embedded space.
> ---
> Why not?

Because that's what it says in RFC 3986, which is what *defines* what a
URL *is*.

Now, someone can provide a string that contains spaces and claim it's a
URL, but it isn't.  The question is, What to do with it?  My preference
is to barf and tell the user that what they provided wasn't a proper
URL.

Beyond that, one might do some simple tidying up, such as removing
leading and trailing spaces.  That fix, by the way, is known to be safe,
*because a URL can't contain a space*, and so any trailing space can't
actually be part of the URL.

It gets uglier when there are invalid characters in the middle of the
URL, because simply deleting them is unlikely to produce the results the
user expected.

Dale

Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?

2017-06-12 Thread Dale R. Worley

L A Walsh  writes:
> W/cut+paste into target line, where URL is double-quoted.  More often
> than not, I find it safer to double-quote a URL than not, because, for
> example, shells react badly to embedded spaces, ampersands and
> question marks.

But of course, no URL contains an embedded space.

If you double-quote, most shells have four special characters that are
processed within double-quotes:  " \ $ `  Of the four, only $ can appear
in a URL.

If you single-quote, most shells have only one special character, which
ends the string: '  Unfortunately, it's allowed in URLs.

So there's no quoting character which you can just put before and after
a URL and be sure that your shell won't damage the URL.

Dale

Re: [Bug-wget] How to disable reading or settings in .wgetrc?

2017-06-01 Thread Dale R. Worley

Avinash Sonawane  writes:
> BTW if you want to ignore .wgetrc altogether (which seems to be
> indicated by the subject line) then you have:
>
> $ man wget
>
> "--config=FILE
>Specify the location of a startup file you wish to use."
>
> So in that case you need to use --config (without any value) on the
> command line.

Actually, it means you could use "--config=/dev/null".  The
documentation does not suggest that --config is valid without a value.
So we really should add:

--config
Do not use any startup file.

Dale

Re: [Bug-wget] Bugs/New features

2017-05-28 Thread Dale R. Worley

Glacial Man  writes:
> 1) The name of the downloaded file is "windows." (or "stable."),
> probably because are not available static links but server requests
> only. How can I indicate to wget (used with --timestamping option)
> that the local file name's must be "uTorrent.exe"? (as if I download
> it using a browser)

> Pratically, how can I indicate to wget (used with --timestamping
> option) the path of the local file?

You could probably use --output-document=name.  However, since wget
conceptually can download many files in a session, and --output-document
is defined to put all downloaded files into one file, there might be
unpleasant interactions with --timestamping or --no-if-modified-since.

> when I use the --timestamping option, I need to manage the
> different cases to know what's occurred and, for example, to write
> in a log file

wget won't do that, but you could do something like:

# May return error if uTorrent.exe doesn't exist.
cp uTorrent.exe uTorrent.exe.save 
wget ...
if ! cmp uTorrent.exe uTorrent.exe.save
then
... do whatever you do if it has changed
fi
rm uTorrent.exe.save

Or you could save the hash of it in a shell variable and compare that:

OLD_HASH=$( sha1sum uTorrent.exe )
wget ...
NEW_HASH=$( sha1sum uTorrent.exe )
if [ $OLD_HASH != $NEW_HASH ]
then
...

> 3) The same for the --spider option, I need to know if the file is:
>  - downloadable (local file not available)
>  - updatable (different timestamp)
>  - already up to date (same timestamp)

It looks like wget --spider only tells you if the file exists on the
server.

Dale

Re: [Bug-wget] Too verbose version information

2017-04-18 Thread Dale R. Worley

Mojca Miklavec  writes:
> I find it a bit annoying that I end up with all of the following
> information after building wget:
[...]
> The first part is reasonable (even if I don't like the fact that the
> exact OS version is hardcoded; after security updates the binary will
> change only due to the increased minor version, so reproducibility is
> lost):

Mostly, --version is so that if someone wants to know if the
documentation they're reading matches the executable they're running,
they can tell.  And as Tim said, --version gives the information that a
developer needs for a bug report.

But you say "reproducibility is lost"...  It sounds like you have a need
that the current --version cannot be used to satisfy.  What is that
need?

Dale

Re: [Bug-wget] Vulnerability Report - CRLF Injection in Wget Host Part

2017-03-08 Thread Dale R. Worley

Ander Juaristi <ajuari...@gmx.es> writes:
> On 06/03/17 16:47, Dale R. Worley wrote:
>> Orange Tsai <orange.8...@gmail.com> writes:
>>> # This will work
>>> $ wget 'http://127.0.0.1%0d%0aCookie%3a hi%0a/'
>> 
>> Not even considering the effect on headers, it's surprising that wget
>> doesn't produce an immediate error, since
>> "127.0.0.1%0d%0aCookie%3a hi%0a" is syntactically invalid as a host
>> part.  Why doesn't wget's URL parser detect that?
>
> Simply because it first splits the URL into several parts according to
> the delimiters, and then decodes the percent-encoding.
>
> Additionally for the host part it also checks whether it's an IP address
> and the IDNA stuff, but yeah you raise a good point. Other than that the
> host part is treated similarly to the other parts.

Ah, I looked into RFC 3986, and the generic syntax *does* allow the host
part to contain %-escapes.  But in any case,
"127.0.0.1Cookie:hi" is not parsable as an IPv4
address.  (Always beware of parsing functions that stop when they see
the first invalid character!)

(Also, shouldn't the above example have ended "hi%0d%0a/"?)

Dale

Re: [Bug-wget] recursive_retrieve()

2017-03-06 Thread Dale R. Worley

Tim Rühsen  writes:
> Did you try wildcard matching ? (-A "*.pdf*")

That's a bit subtle, though.  The -A pattern apparently has to match
everything in the URL after the final /, *including* the query-part
("?..."), which strictly speaking isn't part of the file name.  But the
documentation of -A/-R (in 1.16) describes it as a pattern to match the
file name:

   -A acclist --accept acclist
   -R rejlist --reject rejlist
   Specify comma-separated lists of file name suffixes or patterns to
   accept or reject. Note that if any of the wildcard characters, *,
   ?, [ or ], appear in an element of acclist or rejlist, it will be
   treated as a pattern, rather than a suffix.  In this case, you have
   to enclose the pattern into quotes to prevent your shell from
   expanding it, like in -A "*.mp3" or -A '*.mp3'.

Then again, what does -A/-R match against in a URL
"http://example.com/file.pdf?a/b/c;?

It seems like this needs something like "Note that this is matched
against the entire part of the URL following the final slash; for URLs
containing queries, it may not be the final component of the path part
of the URL."

Dale

Re: [Bug-wget] Vulnerability Report - CRLF Injection in Wget Host Part

2017-03-06 Thread Dale R. Worley

Orange Tsai  writes:
> # This will work
> $ wget 'http://127.0.0.1%0d%0aCookie%3a hi%0a/'

Not even considering the effect on headers, it's surprising that wget
doesn't produce an immediate error, since
"127.0.0.1%0d%0aCookie%3a hi%0a" is syntactically invalid as a host
part.  Why doesn't wget's URL parser detect that?  I'm sure the new
patch is an improvement, but it's surprising that the old code didn't
detect that was an invalid URL anyway, since it contains characters that
aren't permitted in those locations.

Dale

Re: [Bug-wget] wget test results

2017-02-24 Thread Dale R. Worley

Alasdair Thomas  writes:
> FAIL: Test-auth-basic
> =
>
> Can't locate HTTP/Daemon.pm in @INC (@INC contains: . 
> /opt/local/lib/perl5/site_perl/5.16.3/darwin-thread-multi-2level 
> /opt/local/lib/perl5/site_perl/5.16.3 
> /opt/local/lib/perl5/vendor_perl/5.16.3/darwin-thread-multi-2level 
> /opt/local/lib/perl5/vendor_perl/5.16.3 
> /opt/local/lib/perl5/5.16.3/darwin-thread-multi-2level 
> /opt/local/lib/perl5/5.16.3 .) at HTTPServer.pm line 6.
> BEGIN failed--compilation aborted at HTTPServer.pm line 6.
> Compilation failed in require at HTTPTest.pm line 6.
> BEGIN failed--compilation aborted at HTTPTest.pm line 6.
> Compilation failed in require at ./Test-auth-basic.px line 6.
> BEGIN failed--compilation aborted at ./Test-auth-basic.px line 6.
> FAIL Test-auth-basic.px (exit status: 2)

As you can see, all the messages are about "Can't locate HTTP/Daemon.pm".

On my system, that file is /usr/share/perl5/vendor_perl/HTTP/Daemon.pm:

$ locate Daemon.pm
/usr/share/perl5/vendor_perl/HTTP/Daemon.pm
/usr/share/perl5/vendor_perl/Net/Daemon.pm
$

That file comes from the "perl-HTTP-Daemon" package:

$ rpm -q --file /usr/share/perl5/vendor_perl/HTTP/Daemon.pm
perl-HTTP-Daemon-6.01-5.fc19.noarch
$

So it looks like you need to acquire that file/package in the
appropriate way for your system.

Dale

Re: [Bug-wget] Current wget release

2017-02-02 Thread Dale R. Worley

Tim Ruehsen <tim.rueh...@gmx.de> writes:
> thanks, just push patch #1.

I've included that patch below (which I think is the correct way to
submit it).  It applies to the current commit (4734e8d 2017-01-17
15:16:40 +0100 * NEWS: update).

> The second one is still not a candidate for a release. The patch changes some 
> basic / default behavior and we still got no opinions/discussion from anyone 
> testing it. Sorry, but from the old thread there are too many open points.

OK, I'll go back in the mailing list and rejoin the discussion.

Dale

>From 740c68d4d820334362dc93ce5c31b9d742f12558 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Wed, 2 Nov 2016 12:14:46 -0400
Subject: [PATCH] Improve documentation of --trust-server-names.

---
 doc/wget.texi | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/doc/wget.texi b/doc/wget.texi
index 91219e5..3632fd1 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -1700,9 +1700,11 @@ with a http status code that indicates error.
 @cindex Trust server names
 @item --trust-server-names
 
-If this is set to on, on a redirect the last component of the
-redirection URL will be used as the local file name.  By default it is
-used the last component in the original URL.
+If this is set, on a redirect, the local file name will be based
+on the redirection URL.  By default the local file name is is based on
+the original URL.  When doing recursive retrieving this can be helpful
+because in many web sites redirected URLs correspond to an underlying
+file structure, while link URLs do not.
 
 @cindex authentication
 @item --auth-no-challenge
@@ -3261,8 +3263,8 @@ Turn on recognition of the (non-standard) 
@samp{Content-Disposition}
 HTTP header---if set to @samp{on}, the same as @samp{--content-disposition}.
 
 @item trust_server_names = on/off
-If set to on, use the last component of a redirection URL for the local
-file name.
+If set to on, construct the local file name from redirection URLs
+rather than original URLs.
 
 @item continue = on/off
 If set to on, force continuation of preexistent partially retrieved
-- 
1.8.3.1

[Bug-wget] Current wget release

2017-01-31 Thread Dale R. Worley

Are the following two commits that I submitted in the current release?
I can't see any signs of them, even though I've gone through the
paperwork and they seemed to be accepted.

commit de020df92a797aa1b9ad339bb8a01df872ef4f23
Author: Dale R. Worley <wor...@ariadne.com>
Date:   Wed Nov 2 12:14:46 2016 -0400

Improve documentation of --trust-server-names.

commit 98c8d987b1be3e8924dee87e4e78e1e4f2b2eb9e
Author: Dale R. Worley <wor...@ariadne.com>
Date:   Sun Oct 16 14:44:15 2016 -0400

Amend redirection behavior

* doc/wget.text: Update documentation.  Fix errors and omissions.
* src/convert.h (struct urlpos): Add link_redirect_p flag to struct urlpos 
to
  indicate the URL resulted from a redirection.
* src/recur.c (download_child): Suppress --no-parent check for redirection
  URLs.
* src/recur.c (download_child): Suppress directory checks for redirection
  URLs and page requisites (if -p).
* src/recur.c (descend_redirect): Set link_redirect_p flag on struct urlpos
  for redirection URLs.  Remove old test for suppressing directory checks 
for
  redirection URLs.

Dale

Re: [Bug-wget] Unable to establish SSL connection error

2017-01-06 Thread Dale R. Worley

Raitis Misa  writes:
> I'm trying to download APOD with line - wget.exe -x -r -k -E -nc -e 
> robots=off --page-requisites --tries=2 --level=2 --timeout=20 
> --user-agent="Mozilla 1.5" --secure-protocol=TLSv1 
> --no-check-certificate http://apod.nasa.gov/apod/archivepix.html
>
> other --secure-protocol= options gives the same result as well as not 
> using --no-check-certificate .
>
> GNU Wget 1.11.4
> Microsoft Windows [Version 10.0.10586]

You have not told us what the results were and any error messages that
were output.

When I execute the similar command line (on Linux), I get this error:

Connecting to apod.nasa.gov (apod.nasa.gov)|2001:4d0:2310:150::22|:443... 
connected.
OpenSSL: error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake 
failure
Unable to establish SSL connection.

If I leave off the --secure-protocol option, it works.  Similarly, if I
specify --secure-protocol=TLSv1_1 or --secure-protocol=TLSv1_2, it
works.  So I suspect that TLSv1 isn't supported by that server.

Dale

Re: [Bug-wget] Favicon is not downloaded (Suggestion for improvement)

2017-01-06 Thread Dale R. Worley

Tim Rühsen  writes:
> Dale, do you mind to open an issue for that at https://
> github.com/rockdaboot/wget2 ?
> IMO, it should go there first.

Done.

I've also added it to Gnu Savannah's issue tracker.

Dale

Re: [Bug-wget] Favicon is not downloaded (Suggestion for improvement)

2017-01-05 Thread Dale R. Worley

Павел Серегов  writes:
> Often not exist code for favicon (in index.html), but site have.
>
> My suggestion:
> If use wget -m, need make download  http://example.com/favicon.ico
>
> How do you like the idea?

The documentation for -m is:

   -m
   --mirror
   Turn on options suitable for mirroring.  This option turns on
   recursion and time-stamping, sets infinite recursion depth and
   keeps FTP directory listings.  It is currently equivalent to -r -N
   -l inf --no-remove-listing.

I suggest defining "--favicon" specifically to download
http(s):///favicon.ico, and then add --favicon to the specification
of --mirror.

Dale

Re: [Bug-wget] ot: clicking email links advice

2017-01-04 Thread Dale R. Worley

voy...@sbt.net.au writes:
> is there a way to run wget with that url and, tell it to 'press' one of
> the buttons?

Basically, yes, since an HTML "submit" operation causes an HTTP request
to be sent.  What you need to learn is the details of the correct HTTP
request, and then figure out how to make wget perform that request.
There are a lot of details; probably the best way to start is to learn
how HTML form submissions work, which can be found in many HTML
references.

Dale

Re: [Bug-wget] Wget wall clock time is very high

2016-12-14 Thread Dale R. Worley

Debopam Bhattacherjee  writes:
> I try to download a webpage along with it dependencies using the following
> command:
> ...
> The total download time is 1.4 seconds while the wall clock time is 6.8
> seconds which is much higher. Chrome, in comparison downloads and renders
> everythin in 2-3 seconds.
>
> Why is the wall clock time so high and how can it be reduced?

I can think of three reasons.  1) Chrome might have some of the
dependencies in its cache and doesn't have to download them now.  2)
Chrome probably has more than one TCP connection open to
www.standford.edu, and so is fetching files in parallel.  3)
www.stanford.edu has IPv6 addresses as well as an IPv4 address.
Browsers impelemnt the "Happy Eyeballs" algorithm (RFC 6555), which
gives better performance if IPv6 connectivity to the server is poor.

Dale

Re: [Bug-wget] Does -o work correctly now?

2016-11-09 Thread Dale R. Worley

Tim Rühsen  writes:
> Looks like commit dd5c549f6af8e1143e1a6ef66725eea4bcd9ad50 broke it.
> 
> Sadly, the test suite doesn't catch it.

"Wajda, Piotr"  writes:
> Patch sent, should be good now.

Does the patch include a test in the test suite to catch this problem?
If not, the job isn't done yet.  (And some day a code change will cause
the problem to reappear.)

Dale

Re: [Bug-wget] [PATCH] Respect -o parameter

2016-11-09 Thread Dale R. Worley

Tim Rühsen  writes:
> Thanks, Piotr ! Commit has been pushed.

Great, thanks!

Dale

[Bug-wget] Does -o work correctly now?

2016-11-08 Thread Dale R. Worley

I've been getting a script working, and as part of it, it appears that
-o does not work correctly.  Specifically -o is supposed to send all
"messages" to a specified log file rather than stderr.  But what I see
is that no messages are sent to the log file.

The command in question is:

wget --no-verbose \
-o ~/temp/log-file \
--mirror --trust-server-names --convert-links --page-requisites \
--include-directories=/assignments \
--limit-rate=20k \
http://www.iana.org/assignments/index.html

With the (now obsolete) wget distributed with my OS, GNU Wget 1.16.1, -o
behaves as documented.  With wget from commit 00ae9b4 (which I think is
the latest), which reports itself as GNU Wget 1.18.88-00ae-dirty, -o
seems to have no effect.

There are any number of mistakes I could have made in this test, but
since the symptom is so simple, I figured I'd ask if anyone else has
noticed whether -o works or does not in the latest commits.

Dale

[Bug-wget] Recursive retrieval

2016-11-02 Thread Dale R. Worley

In regard to my difficulties with recursively retrieving
http://www.iana.org/assignments/index.html:  I discovered that one URL
(http://www.iana.org/assignments/forces/forces.xhtml) is pointed to by
no less than three different URLs:

http://www.iana.org/assignments/forces/forces.xhtml
http://www.iana.org/assignments/forces-parameters/forces-parameters.xhtml
http://www.iana.org/assignments/forces

The first is the proper URL for it, and the second two are redirected to
the first URL.

There are several other occurrences of this situation.

And I discovered that if I specify --trust-server-names, then wget will
realize that the redirection URL can be retrieved once, and links to the
other two URLs can be directed to that one file.  Without
--trust-server-names, wget considers all three URLs to be different,
despite that they are redirected to the same URL, and dutifully stores
essentially the same content three times.  With --trust-server-names,
wget understands that all three URLs are the same.

It turns out that this provides me with a much better mirror of the web
site.

I've attached a patch that improves the documentation of
--trust-server-names, to clarify that if -nd is not in effect, then the
file name is constructed from the entire redirection URL, not just the
last component.

(--trust-server-names is also mentioned in doc/metalink-standard.txt,
but that text does not seem to me to have the problem the patch
corrects.)

Dale
>From 740c68d4d820334362dc93ce5c31b9d742f12558 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Wed, 2 Nov 2016 12:14:46 -0400
Subject: [PATCH] Improve documentation of --trust-server-names.

---
 doc/wget.texi | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/doc/wget.texi b/doc/wget.texi
index 91219e5..3632fd1 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -1700,9 +1700,11 @@ with a http status code that indicates error.
 @cindex Trust server names
 @item --trust-server-names
 
-If this is set to on, on a redirect the last component of the
-redirection URL will be used as the local file name.  By default it is
-used the last component in the original URL.
+If this is set, on a redirect, the local file name will be based
+on the redirection URL.  By default the local file name is is based on
+the original URL.  When doing recursive retrieving this can be helpful
+because in many web sites redirected URLs correspond to an underlying
+file structure, while link URLs do not.
 
 @cindex authentication
 @item --auth-no-challenge
@@ -3261,8 +3263,8 @@ Turn on recognition of the (non-standard) @samp{Content-Disposition}
 HTTP header---if set to @samp{on}, the same as @samp{--content-disposition}.
 
 @item trust_server_names = on/off
-If set to on, use the last component of a redirection URL for the local
-file name.
+If set to on, construct the local file name from redirection URLs
+rather than original URLs.
 
 @item continue = on/off
 If set to on, force continuation of preexistent partially retrieved
-- 
1.8.3.1

Re: [Bug-wget] Info wget

2016-10-27 Thread Dale R. Worley

Daniele Dinaro  writes:
> the form is this:
>
>> http://www.armaholic.net/downloader.php?download_file=chili/addons/units/BTC-Militia-version-1.1.7z
>> ">
>> 
>> 
>> 
>> What is two plus two?
>> 
>> 
>> 
>> 
>
>
> I have write this command for my .bat file
>
>> %WGET%  --http-user=Mozilla/5.0 --save-cookies=cookies.txt
>> --keep-session-cookies --header="Content-Type:
>> application/x-www-form-urlencoded" -c --no-check-certificate
>> --post-data=--post-data="captcha=I am a human^!=3910CD8F" -c
>> http://www.armaholic.net/downloader.php?download_file=chili/addons/units/BTC-Militia-version-1.1.7z

Here are suggestions that I can think of:

I many cases, web sites that use "form method=post" will also accept
"form method=get", that is, using "?" to add values to the URL.  It is
much easier to use wget to do that:

wget 
'http://www.armaholic.net/downloader.php?download_file=chili/addons/units/BTC-Militia-version-1.1.7z?x=3910CD8F=4=I%20am%20a%20human!submit=Click%20to%20download%20%3dBTC%3d%20Militia'

Check that you have included all of the field values that you need to
include.  It appears to me that your wget command does not provide
values for 'super' and 'submit' fields.  (Yes, the submit button is a
field whose value is transmitted, that's how the server knows which
submit button in the form was pressed.)

Check that you have properly encoded the values of the values.  I don't
know the details of the rules myself, but I see that the MIME type is
x-www-form-urlencoded, which suggests that any character that is special
in a URL must be represented with %hh.  In this case, spaces and "="
appear in your values.

I see that your command includes:

>> --post-data=--post-data="captcha=I am a human^!=3910CD8F" -c

But there shouldn't be two "--post-data" items; the second one is part
of the *value* of the post-data option!

Is there a simpler form that you can use for practice?  For instance,
can you write a wget to fetch a Google search?

There is a "^" before the "!".  If this isn't needed as an escape
character in .bat files, it should be removed, as it isn't part of the
value you want.

Given that the fetch is not HTTPS, you can use wireshark, tcpdump, or
other networking monitoring tools to examine the packets when you click
the submit button on your browser.  That will show exactly how your
browser makes the request, and you can copy that.  Similarly, you can
see how wget sends the request, and adjust your command line
appropriately.  Or use "wget --debug" to see the request that wget
sends.

Dale

Re: [Bug-wget] Filtering for requisites and redirections

2016-10-16 Thread Dale R. Worley

Tim Ruehsen <tim.rueh...@gmx.de> writes:
> could you create local commits (maybe you already have) and attach the output 
> of 'git format-patch -1' (-1 = last one commit, -2 = last two commits, ...) ?

I've cleaned up the documentation changes and provided a proper commit
message.

Dale
>From 14fe0982e02ee4c10b241f9e7a29fb3e5164c6d5 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Sun, 16 Oct 2016 14:44:15 -0400
Subject: [PATCH] Amend redirection behavior

* doc/wget.text: Update documentation.  Fix errors and omissions.
* src/convert.h (struct urlpos): Add link_redirect_p flag to struct urlpos to
  indicate the URL resulted from a redirection.
* src/recur.c (download_child): Suppress --no-parent check for redirection
  URLs.
* src/recur.c (download_child): Suppress directory checks for redirection
  URLs and page requisites (if -p).
* src/recur.c (descend_redirect): Set link_redirect_p flag on struct urlpos
  for redirection URLs.  Remove old test for suppressing directory checks for
  redirection URLs.
---
 doc/wget.texi | 41 +
 src/convert.h |  1 +
 src/recur.c   | 53 -
 3 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/doc/wget.texi b/doc/wget.texi
index f42773e..91219e5 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -2357,6 +2357,11 @@ your shell from expanding it, like in @samp{-A "*.mp3"} or @samp{-A '*.mp3'}.
 @itemx --reject-regex @var{urlregex}
 Specify a regular expression to accept or reject the complete URL.
 
+@strong{Note} that the effect of @samp{--accept-regex} and
+@samp{--reject-regex}  is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @item --regex-type @var{regextype}
 Specify the regular expression type.  Possible types are @samp{posix} or
 @samp{pcre}.  Note that to be able to use @samp{pcre} type, wget has to be
@@ -2431,18 +2436,32 @@ Specify a comma-separated list of directories you wish to follow when
 downloading (@pxref{Directory-Based Limits}).  Elements
 of @var{list} may contain wildcards.
 
+@strong{Note} that the effect of @samp{--include-directories} and
+@samp{--exclude-directories} is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @item -X @var{list}
 @itemx --exclude-directories=@var{list}
 Specify a comma-separated list of directories you wish to exclude from
 download (@pxref{Directory-Based Limits}).  Elements of
 @var{list} may contain wildcards.
 
+@strong{Note} that the effect of @samp{--include-directories} and
+@samp{--exclude-directories} is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @item -np
 @item --no-parent
-Do not ever ascend to the parent directory when retrieving recursively.
+Do not ascend to the parent directory when retrieving recursively.
 This is a useful option, since it guarantees that only the files
 @emph{below} a certain hierarchy will be downloaded.
 @xref{Directory-Based Limits}, for more details.
+
+@strong{Note} that the effect of @samp{--no-parent} is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
 @end table
 
 @c man end
@@ -2689,6 +2708,11 @@ comma-separated list, and given as an argument to @samp{-A}.
 The argument to @samp{--accept-regex} option is a regular expression which
 is matched against the complete URL.
 
+@strong{Note} that the effect of @samp{--accept-regex} and
+@samp{--reject-regex}  is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @cindex reject wildcards
 @cindex reject suffixes
 @cindex wildcards, reject
@@ -2709,9 +2733,14 @@ Analogously, to download all files except the ones beginning with
 expansion by the shell.
 @end table
 
-The argument to @samp{--accept-regex} option is a regular expression which
+The argument to @samp{--reject-regex} option is a regular expression which
 is matched against the complete URL.
 
+@strong{Note} that the effect of @samp{--accept-regex} and
+@samp{--reject-regex}  is suppressed for
+fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @noindent
 The @samp{-A} and @samp{-R} options may be combined to achieve even
 better fine-tuning of which files to retrieve.  E.g. @samp{wget -A
@@ -2778,12 +2807,16 @@ Wget offers three different options to deal with this requirement.  Each
 option description lists a short name, a long name, and the equivalent
 command in @file{.wgetrc}.
 
+@strong{Note} that the effect of all of these options is suppressed
+for fetching redirection URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @cin

Re: [Bug-wget] Filtering for requisites and redirections

2016-10-14 Thread Dale R. Worley

Tim Ruehsen  writes:
> could you create local commits (maybe you already have) and attach the output 
> of 'git format-patch -1' (-1 = last one commit, -2 = last two commits, ...) ?

I take that as a go-ahead for this approach to the issue.

Dale

Re: [Bug-wget] Filtering for requisites and redirections

2016-10-14 Thread Dale R. Worley

Tim Ruehsen  writes:
>> Perhaps we do not want to have --no-parent suppressed by
>> --page-requisites.  It seems that --no-parent is intended as a security
>> measure, and the existing code (as well as this proposal) violate its
>> fundamental premise.
>
> --no-parent seems to be intended as a bandwidth limiter together with -r. 
> When 
> talking about security, what realistic scenario do you have in mind ?
>
> Anyways, we definitely don't want to change the default behavior.

What I see in the manual page (admittedly, an old one, 1.16.1) is:

   -np
   --no-parent
   Do not ever ascend to the parent directory when retrieving
   recursively.  This is a useful option, since it guarantees that
   only the files below a certain hierarchy will be downloaded.

In the Info page, I see more:

In 2.11, "Recursive Accept/Reject Options":
'-np'
'--no-parent'
 Do not ever ascend to the parent directory when retrieving
 recursively.  This is a useful option, since it guarantees that
 only the files _below_ a certain hierarchy will be downloaded.
 *Note Directory-Based Limits::, for more details.
In 4.3, "Directory-Based Limits":
'-np'
'--no-parent'
'no_parent = on'
 The simplest, and often very useful way of limiting directories is
 disallowing retrieval of the links that refer to the hierarchy
 "above" than the beginning directory, i.e.  disallowing ascent to
 the parent directory/directories.

 The '--no-parent' option (short '-np') is useful in this case.
 Using it guarantees that you will never leave the existing
 hierarchy.  Supposing you issue Wget with:

  wget -r --no-parent http://somehost/~luzer/my-archive/

 You may rest assured that none of the references to
 '/~his-girls-homepage/' or '/~luzer/all-my-mpegs/' will be
 followed.  Only the archive you are interested in will be
 downloaded.  Essentially, '--no-parent' is similar to
 '-I/~luzer/my-archive', only it handles redirections in a more
 intelligent fashion.

 *Note* that, for HTTP (and HTTPS), the trailing slash is very
 important to '--no-parent'.  HTTP has no concept of a
 "directory"--Wget relies on you to indicate what's a directory and
 what isn't.  In 'http://foo/bar/', Wget will consider 'bar' to be a
 directory, while in 'http://foo/bar' (no trailing slash), 'bar'
 will be considered a filename (so '--no-parent' would be
 meaningless, as its parent is '/').

The text "You may rest assured that none of the references to
'/~his-girls-homepage/' or '/~luzer/all-my-mpegs/' will be
followed." suggests that --no-parent can be relied upon as a type of
security feature.

I am not personally deeply concerned about this.  But I want to see the
issue discussed on the mailing list, as the current default behavior
differs from the documentation in a way that might be important.

Dale

[Bug-wget] wget.texi

2016-10-13 Thread Dale R. Worley

I see in the current wget.info file:

'-np'
'--no-parent'
'no_parent = on'
 ...  Essentially, '--no-parent' is similar to
 '-I/~luzer/my-archive', only it handles redirections in a more
 intelligent fashion.

However, I don't see how the current code handles redirections any
differently than does --include.  Perhaps I am missing something?

Dale

[Bug-wget] Filtering for requisites and redirections

2016-10-13 Thread Dale R. Worley

If --page-requisites is specified along with --no-parent, then requisite
files will be downloaded even if their URLs would normally be suppressed
by --no-parent.  This is implemented by a test in section 4 of
download_child in recur.c, and a flag in struct urlpos, link_inline_p,
which says that the *context* of that URL is as a page requisite.

This suggests that the exceptional processing we want to implement for
redirections might be more systematically implemented by using the above
processing as a model, and not by testing the value returned by
download_child.  This involves adding a flag link_redirect_p to struct
urlpos; this flag functions as an alternative to the additional argument
to download_child that I previously suggested.

In addition, this approach avoids the problem of ensuring that
download_child returns the correct value if a URL fails more than one
test, e.g., --accept-regex and robots, because any tests that are to be
ignored in the context are not executed and do not affect the return
value.

It also suggests that we may want to define that --no-parent does not
apply to redirections, in the same way that it does not apply to page
requisites when --page-requisite is set.

I've also updated the TEXI file to describe the functional changes, and
also the previously-undocumented behavior of --page-requisites
overriding --no-parent.  The changes are in the attached diff.

However, looking at the documentation for --no-parent:

   -np
   --no-parent
   Do not ever ascend to the parent directory when retrieving
   recursively.  This is a useful option, since it guarantees that
   only the files below a certain hierarchy will be downloaded.

   Note that the effect of --no-parent is suppressed for fetching
   redirected URLs and for fetching page requisite URLs if
   --page-requisites is specified.

Perhaps we do not want to have --no-parent suppressed by
--page-requisites.  It seems that --no-parent is intended as a security
measure, and the existing code (as well as this proposal) violate its
fundamental premise.

Dale
diff --git a/doc/wget.texi b/doc/wget.texi
index f42773e..2990408 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -2357,6 +2357,11 @@ your shell from expanding it, like in @samp{-A "*.mp3"} or @samp{-A '*.mp3'}.
 @itemx --reject-regex @var{urlregex}
 Specify a regular expression to accept or reject the complete URL.
 
+@strong{Note} that the effect of @samp{--accept-regex} and
+@samp{--reject-regex}  is suppressed for
+fetching redirected URLs and for fetching page requisite URLs if
+@samp{--page-requisites} is specified.
+
 @item --regex-type @var{regextype}
 Specify the regular expression type.  Possible types are @samp{posix} or
 @samp{pcre}.  Note that to be able to use @samp{pcre} type, wget has to be
@@ -2437,12 +2442,21 @@ Specify a comma-separated list of directories you wish to exclude from
 download (@pxref{Directory-Based Limits}).  Elements of
 @var{list} may contain wildcards.
 
+@strong{Note} that the effect of @samp{--include-directories} and
+@samp{--exclude-directories} is suppressed for
+fetching redirected URLs and for fetching page requisite ULRs if
+@samp{--page-requisites} is specified.
+
 @item -np
 @item --no-parent
 Do not ever ascend to the parent directory when retrieving recursively.
 This is a useful option, since it guarantees that only the files
 @emph{below} a certain hierarchy will be downloaded.
 @xref{Directory-Based Limits}, for more details.
+
+@strong{Note} that the effect of @samp{--no-parent} is suppressed for
+fetching redirected URLs and for fetching page requisite ULRs if
+@samp{--page-requisites} is specified.
 @end table
 
 @c man end
diff --git a/src/convert.h b/src/convert.h
index e3ff6f0..af0ab79 100644
--- a/src/convert.h
+++ b/src/convert.h
@@ -72,6 +72,7 @@ struct urlpos {
   unsigned int link_noquote_html_p :1; /* from HTML, but doesn't need " */
   unsigned int link_expect_html :1; /* expected to contain HTML */
   unsigned int link_expect_css  :1; /* expected to contain CSS */
+  unsigned int link_redirect_p  :1; /* the url comes from a redirection */
 
   unsigned int link_refresh_p   :1; /* link was received from
 */
diff --git a/src/recur.c b/src/recur.c
index 1469e31..7bbcd44 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -651,13 +651,14 @@ download_child (const struct urlpos *upos, struct url *parent, int depth,
 
  If we descended to a different host or changed the scheme, ignore
  opt.no_parent.  Also ignore it for documents needed to display
- the parent page when in -p mode.  */
+ the parent page when in -p mode or redirections.  */
   if (opt.no_parent
   && schemes_are_similar_p (u->scheme, start_url_parsed->scheme)
   && 0 == strcasecmp (u->host, start_url_parsed->host)
   && (u->scheme != start_url_parsed->scheme
   || u->port == start_url_parsed->port)
-  &&

[Bug-wget] Filtering of page requisites

2016-10-12 Thread Dale R. Worley

So I've run into another version of the problem:  I'm using
--page-requisites, and they're getting filtered in much the same way as
redirections.  However, the new fixes don't change that behavior.

The example case is that
$ wget --mirror --convert-links --page-requisites --limit-rate=20k \
--include-directories=/assignments \
http://www.iana.org/assignments/index.html
does not fetch the CSS specified by
http://www.iana.org/assignments/index.html in

which is http://www.iana.org/_css/2015.1/screen.css.

It looks like requisite URLs are flagged with link_inline_p of struct
urlpos true.  If that flag is set and opt.page_requisites is set, then
test 4 of download_child is suppressed (which is the --no-parent test).

This change seems to add the same logic as is applied to redirections:

diff --git a/src/recur.c b/src/recur.c
index 1469e31..b1f9109 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -462,6 +462,12 @@ retrieve_tree (struct url *start_url_parsed, struct iri 
*pi)
 
   r = download_child (child, url_parsed, depth,
   start_url_parsed, blacklist, i);
+ if (child->link_inline_p &&
+ (reason == WG_RR_LIST || reason == WG_RR_REGEX))
+   {
+ DEBUGP (("Ignoring decision for page requisite, decided 
to load it.\n"));
+ reason = WG_RR_SUCCESS;
+   }
   if (r == WG_RR_SUCCESS)
 {
   ci = iri_new ();

and it has the expected effect, the requisites for index.html are
downloaded.

I've attached a patch for this that includes an update to the manual page.
Although the update to the manual page doesn't mention the suppression
of the --no-parent test.

Dale
diff --git a/doc/wget.texi b/doc/wget.texi
index f42773e..04d1562 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -2289,7 +2289,11 @@ wget -p http://@var{site}/1.html
 @end example
 
 Note that Wget will behave as if @samp{-r} had been specified, but only
-that single page and its requisites will be downloaded.  Links from that
+that single page and its requisites will be downloaded.
+(As with @samp{-r}, the @samp{--include-directories},
+@samp{--exclude-directories}, @samp{--accept-regex}, and @samp{--reject-regex}
+tests are not applied to page requisites.)
+Links from that
 page to external documents will not be followed.  Actually, to download
 a single page and all its requisites (even if they exist on separate
 websites), and make sure the lot displays properly locally, this author
diff --git a/src/recur.c b/src/recur.c
index 1469e31..fdb1d2e 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -462,6 +462,12 @@ retrieve_tree (struct url *start_url_parsed, struct iri *pi)
 
   r = download_child (child, url_parsed, depth,
   start_url_parsed, blacklist, i);
+		  if (child->link_inline_p &&
+		  (r == WG_RR_LIST || r == WG_RR_REGEX))
+		{
+		  DEBUGP (("Ignoring decision for page requisite, decided to load it.\n"));
+		  r = WG_RR_SUCCESS;
+		}
   if (r == WG_RR_SUCCESS)
 {
   ci = iri_new ();

[Bug-wget] Wget redirection behavior

2016-10-10 Thread Dale R. Worley

I've built and tested the 3d1d5b3 commit, which is the current head, or
near it.

In regard to my most important problem, that code will do what I need,
which is to download the IANA assignments files:

$ wget -r --mirror --convert-links --page-requisites \
--include-directories=/assignments \
http://www.iana.org/assignments/index.html

In particular, the file http://www.iana.org/assignments/index.html
(which redirects to http://www.iana.org/assignments!) will be
downloaded.

Looking at your code changes, redirections are exempted from the tests
which cause download_child to return WG_RR_LIST or WG_RR_REGEX, which
are the tests based on the options
   --include-directories=list
   --exclude-directories=list
   --accept-regex urlregex
   --reject-regex urlregex
and no others.  These are the tests implemented by section 5 of
download_child.  (Am I correct here?)  In particular, the tests --accept
and --reject *are* applied.

It would be helpful if the manual page documented that tests are applied
to redirections, and which ones.  One way would be to add this text at
the start of the section "Recursive Accept/Reject Options":

   Recursive Accept/Reject Options
   Note that if an HTTP request receives a redirection response, the
   redirect URL is subjected to the same tests as any
   recursively-fetched URL, with the exception that the
   --include-directories, --exclude-directories, --accept-regex, and
   --reject-regex tests are not applied.

   -A acclist --accept acclist
   -R rejlist --reject rejlist
   ...

I see no reason to try at this time to figure out what options might be
needed to adjust this behavior, since I have only the one use case.

But looking at the organization of the code, it seems that we require
that download_child should return WG_RR_LIST or WG_RR_REGEX only if that
is the *only* reason that download_child would reject the URL.  E.g., if
the URL fails both the --accept-regex and the robots test,
download_child must return WG_RR_ROBOTS, not WG_RR_REGEX.  Otherwise,
redirections that fail both the --accept-regex and robots tests will be
followed, while redirections that fail only the robots test will not be
followed!  And that requires that the tests of section 5 be at the end
of download_child, and they aren't now.

So it seems to me that download_child needs to be reordered, and its
interface needs to document that the tests that redirections are
exempted from must remain at the end.

Alternatively, download_child could be provided with an additional
boolean argument telling whether the section 5 tests should be applied.

An independent item:  I notice that the tarball comes with *no* build
instructions whatsoever.  I have some memory that I've tangled with this
before, and that the correct behavior is to run "./boostrap".  In any
case, that worked for me.  IIRC, a file named "INSTALL" cannot be put
into the tarball because it would conflict with the INSTALL link that
will be added by ./bootstrap.  But it would be useful to the newcomer if
a file named "INSTALL-tarball" contained simple contents like:

In the tarball distribution, the INSTALL file is absent because it
is created by the bootstrap process.

If the INSTALL file is absent, first run the bootstrap script to
create it:

$ ./bootstrap

Then follow the build instructions in the INSTALL file.

Thanks for all your help with this problem!

Dale

Re: [Bug-wget] [PATCH] Patch to change behavior with redirects under --recurse.

2016-10-07 Thread Dale R. Worley

Tim Ruehsen  writes:
> the changes in recur.c are not acceptable. They circumvent too many checks 
> like host-spanning, excludes and even --https-only.

I suppose it depends on what you consider the semantics to be.
Generally, I look at it if I've specified to download http://x/y/z and
http://x/y/z redirects to http://a/b/c, if http://x/y/z passes the tests
I've specified, then the page should be downloaded; the fact that it's
redirected to http://a/b/c is incidental.  Most checks *should* be
circumvented.

I guess I'd make exceptions for --https-only, which is presumably
placing a requirement on *how* the pages should be fetched, and probably
the robots check, as that's a policy statement by the server.

Dale

Re: [Bug-wget] [WARNING - NOT VIRUS SCANNED] Re: [PATCH] Patch to change behavior with redirects under --recurse.

2016-10-07 Thread Dale R. Worley

Tim Ruehsen  writes:
> Here is a less invasive patch for review & discussion.
>
> WDYT ?

It looks OK to me, but I'm not very familiar with the code.  I assume
that it passes all the tests we've written, which covers the cases that
I care about.

Dale

Re: [Bug-wget] [PATCH] Patch to change behavior with redirects under --recurse.

2016-10-07 Thread Dale R. Worley

Tim Ruehsen  writes:
> the changes in recur.c are not acceptable. They circumvent too many checks 
> like host-spanning, excludes and even --https-only.
>
> Maybe leaving descend_redirect() and checking the returned reject reason 
> could 
> solve your issue. I'll have a closer look at it soon.

Looking at the files I have saved, I certainly made a mistake in
"0004-Patch-to-change-behavior-with-redirects-under-recurs.patch".  This
section was for testing purposes and I should not have sent it:

diff --git a/src/recur.c b/src/recur.c
index 2b17e72..fe0d012 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -360,6 +360,7 @@ retrieve_tree (struct url *start_url_parsed, struct iri *pi)
 {
   reject_reason r = descend_redirect (redirected, 
url_parsed,
 depth, start_url_parsed, blacklist, i);
+  r = WG_RR_SUCCESS;
   if (r == WG_RR_SUCCESS)
 {
   /* Make sure that the old pre-redirect form gets

Dale

Re: [Bug-wget] bug #45790: wget prints it's progress even when background

2016-09-29 Thread Dale R. Worley

Piotr  writes:
> I would like to avoid forcing users to hack like this ;).
> Wget should print to std* when in fg and print to wget.log when in bg, no 
> matter how user gets there.
> I don't think getpgrp() == tcgetpgrp(STDOUT_FILENO) is heavy and should 
> probaby be ok to check it when printing lines.

That makes sense, though I'd be careful to check for errors returned
from tcgetpgrp().

Dale

Re: [Bug-wget] bug #45790: wget prints it's progress even when background

2016-09-28 Thread Dale R. Worley

"Wajda, Piotr"  writes:
> The case with stopping wget is obvious. CTRL+Z and bg should make wget 
> write to file and I can catch bg with SIGCONT.
> But I wonder what to do when after CTRL+Z and bg, user runs fg. In this 
> case there's no signal between bg anf fg,

Though the user could, instead of just "fg", do "fg", then Ctrl-Z, then
"fg" again.  The second "fg" would cause a SIGCONT, and wget could at
that point theck that it had been foregrounded.  Not elegant, but fairly
simple.

Dale

Re: [Bug-wget] Still Failing: darnir/wget#84 (master - 796e30d)

2016-09-04 Thread Dale R. Worley

Travis CI <bui...@travis-ci.org> writes:
> Build Update for darnir/wget
> -
>
> Build: #84
> Status: Still Failing
>
> Duration: 8 minutes and 4 seconds
> Commit: 796e30d (master)
> Author: Dale R. Worley
> Message: Add tests for recursion and redirection.
>
> * testenv/Test-recursive-basic.py: New file. Test basic recursion
> * testenv/Test-recursive-include.py: New File. Recursion test with
> include directories
> * testenv/Test-redirect.py: New File. Basic redirection tests
> * testenv/Makefile.am: Add new tests to makefile
>
> View the changeset: 
> https://github.com/darnir/wget/compare/0787d7253edf...796e30dcea1e
>
> View the full build log and details: 
> https://travis-ci.org/darnir/wget/builds/157133145
>
> --
>
> You can configure recipients for build notifications in your .travis.yml 
> file. See https://docs.travis-ci.com/user/notifications

This is strange.  All of the Build jobs listed on
https://travis-ci.org/darnir/wget/builds/157133145 have the following
failure messages at the end of the log file, which tells that basically
all of the tests in testenv failed without giving any details of how
they failed.  This is different from the preceding tests, which produce
verbose output showing the details of each test running.

make[4]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/tests'
make[3]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/tests'
Making check in util
make[3]: Entering directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/util'
make[3]: Nothing to be done for `check'.
make[3]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/util'
Making check in testenv
make[3]: Entering directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
make  check-TESTS
make[4]: Entering directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
make[5]: Entering directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
FAIL: Test-504.py
FAIL: Test-auth-basic-fail.py
FAIL: Test-auth-basic.py
FAIL: Test-auth-both.py
FAIL: Test-auth-digest.py
FAIL: Test-auth-no-challenge.py
FAIL: Test-auth-no-challenge-url.py
FAIL: Test-auth-retcode.py
FAIL: Test-auth-with-content-disposition.py
FAIL: Test-c-full.py
FAIL: Test-condget.py
FAIL: Test-Content-disposition-2.py
FAIL: Test-Content-disposition.py
FAIL: Test--convert-links--content-on-error.py
FAIL: Test-cookie-401.py
FAIL: Test-cookie-domain-mismatch.py
FAIL: Test-cookie-expires.py
FAIL: Test-cookie.py
SKIP: Test-hsts.py
FAIL: Test-Head.py
SKIP: Test--https.py
SKIP: Test--https-crl.py
SKIP: Test-pinnedpubkey-der-https.py
FAIL: Test-O.py
SKIP: Test-pinnedpubkey-der-no-check-https.py
PASS: Test-missing-scheme-retval.py
SKIP: Test-pinnedpubkey-hash-https.py
SKIP: Test-pinnedpubkey-hash-no-check-fail-https.py
SKIP: Test-pinnedpubkey-pem-fail-https.py
SKIP: Test-pinnedpubkey-pem-https.py
FAIL: Test-Post.py
FAIL: Test-recursive-basic.py
FAIL: Test-recursive-include.py
FAIL: Test-redirect.py
FAIL: Test-redirect-crash.py
FAIL: Test--rejected-log.py
FAIL: Test-reserved-chars.py
FAIL: Test--spider-r.py
=
28 of 29 tests failed
(9 tests were not run)
See testenv/test-suite.log
Please report to bug-wget@gnu.org
=
make[5]: *** [test-suite.log] Error 1
make[5]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
make[4]: *** [check-TESTS] Error 2
make[4]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
make[3]: *** [check-am] Error 2
make[3]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build/testenv'
make[2]: *** [check-recursive] Error 1
make[2]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build'
make[1]: *** [check] Error 2
make[1]: Leaving directory 
`/home/travis/build/darnir/wget/wget-UNKNOWN/_build'
make: *** [distcheck] Error 1

travis_time:end:0e69321d:start=1472831592173225314,finish=1472831865050343254,duration=272877117940^[[0K
^[[31;1mThe command "./contrib/travis-ci $SSL" exited with 2.^[[0m

Done. Your build exited with 1.

Dale

[Bug-wget] [PATCH] Patch to change behavior with redirects under --recurse.

2016-08-24 Thread Dale R. Worley

This is the change that I'm interested in.  I don't expect this to be
put into the distribution without a lot of discussion.

This version changes the behavior of --recurse:  If a file is
downloaded, it will be scanned for links to follow.  This differs from
the current behavior, in which the URL from which the contents were
obtained (after any redirections) is further checked to see if that URL
passes the recursion limitations.

This patch also includes a test to verify the new behavior.

I worry that this is a substantial change of behavior.  OTOH, the
current behavior seems to be very unintuitive.  And the fact that there
is no test for this behavior suggests that people have not been
depending on it.

Comments?

Dale
>From fe409b1447da28f2c5677ee2d8114e83a19c75f1 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Tue, 23 Aug 2016 23:07:23 -0400
Subject: [PATCH] Patch to change behavior with redirects under --recurse.  Add
 test for this change.

---
 src/recur.c| 55 ++--
 testenv/Makefile.am|  1 +
 testenv/Test-recursive-redirect.py | 64 ++
 3 files changed, 68 insertions(+), 52 deletions(-)
 create mode 100644 testenv/Test-recursive-redirect.py

diff --git a/src/recur.c b/src/recur.c
index 2b17e72..72059b4 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -191,8 +191,6 @@ typedef enum
 
 static reject_reason download_child (const struct urlpos *, struct url *, int,
   struct url *, struct hash_table *, struct iri *);
-static reject_reason descend_redirect (const char *, struct url *, int,
-  struct url *, struct hash_table *, struct iri *);
 static void write_reject_log_header (FILE *);
 static void write_reject_log_reason (FILE *, reject_reason,
   const struct url *, const struct url *);
@@ -358,19 +356,9 @@ retrieve_tree (struct url *start_url_parsed, struct iri *pi)
  want to follow it.  */
   if (descend)
 {
-  reject_reason r = descend_redirect (redirected, url_parsed,
-depth, start_url_parsed, blacklist, i);
-  if (r == WG_RR_SUCCESS)
-{
-  /* Make sure that the old pre-redirect form gets
- blacklisted. */
-  blacklist_add (blacklist, url);
-}
-  else
-{
-  write_reject_log_reason (rejectedlog, r, url_parsed, start_url_parsed);
-  descend = false;
-}
+		  /* Make sure that the old pre-redirect form gets
+			 blacklisted. */
+		  blacklist_add (blacklist, url);
 }
 
   xfree (url);
@@ -774,43 +762,6 @@ download_child (const struct urlpos *upos, struct url *parent, int depth,
   return reason;
 }
 
-/* This function determines whether we will consider downloading the
-   children of a URL whose download resulted in a redirection,
-   possibly to another host, etc.  It is needed very rarely, and thus
-   it is merely a simple-minded wrapper around download_child.  */
-
-static reject_reason
-descend_redirect (const char *redirected, struct url *orig_parsed, int depth,
-struct url *start_url_parsed, struct hash_table *blacklist,
-struct iri *iri)
-{
-  struct url *new_parsed;
-  struct urlpos *upos;
-  reject_reason reason;
-
-  assert (orig_parsed != NULL);
-
-  new_parsed = url_parse (redirected, NULL, NULL, false);
-  assert (new_parsed != NULL);
-
-  upos = xnew0 (struct urlpos);
-  upos->url = new_parsed;
-
-  reason = download_child (upos, orig_parsed, depth,
-  start_url_parsed, blacklist, iri);
-
-  if (reason == WG_RR_SUCCESS)
-blacklist_add (blacklist, upos->url->url);
-  else
-DEBUGP (("Redirection \"%s\" failed the test.\n", redirected));
-
-  url_free (new_parsed);
-  xfree (upos);
-
-  return reason;
-}
-
-
 /* This function writes the rejected log header. */
 static void
 write_reject_log_header (FILE *f)
diff --git a/testenv/Makefile.am b/testenv/Makefile.am
index deef18e..036b91c 100644
--- a/testenv/Makefile.am
+++ b/testenv/Makefile.am
@@ -75,6 +75,7 @@ if HAVE_PYTHON3
 Test-Post.py\
 Test-recursive-basic.py \
 Test-recursive-include.py   \
+Test-recursive-redirect.py  \
 Test-redirect.py\
 Test-redirect-crash.py  \
 Test--rejected-log.py   \
diff --git a/testenv/Test-recursive-redirect.py b/testenv/Test-recursive-redirect.py
new file

[Bug-wget] TEST_NAME in Python tests?

2016-08-24 Thread Dale R. Worley

In the file testenv/README is:

Next, is the const variable, TEST_NAME that defines the name of the Test.

Both, the HTTPTest and FTPTest modules have the same prototype:
{
name,
pre_hook,
test_options,
post_hook,
protocols
}
name should be a string, and is usually passed to the TEST_NAME variable,

Remember to always name the Test correctly using the TEST_NAME variable. 
This
is essential since a directory with the Test Name is created and this can
cause synchronization problems when the Parallel Test Harness is used.
One can use the following command on Unix systems to check for TEST_NAME
clashes:
$ grep -r -h "TEST_NAME =" | cut -c13- | uniq -c -d

However, *none* of the tests in testenv has a TEST_NAME parameter.

In addition, the HTTPTest class does not have a "name" argument for its
constructor.  (See testenv/test/http_test.py, line 15.)

What should be done about this?

Dale

[Bug-wget] [PATCH] Add tests for recursion and redirection.

2016-08-24 Thread Dale R. Worley

This adds basic tests for --recursive and for handling 301 (redirection)
responses.

Dale
>From 552cc72fd0957420c7354f3619799aef38788c5e Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Tue, 23 Aug 2016 18:09:16 -0400
Subject: [PATCH 3/4] Add tests for recursion and redirection.

---
 testenv/Makefile.am   |  3 +++
 testenv/Test-recursive-basic.py   | 57 +++
 testenv/Test-recursive-include.py | 56 ++
 testenv/Test-redirect.py  | 57 +++
 4 files changed, 173 insertions(+)
 create mode 100755 testenv/Test-recursive-basic.py
 create mode 100755 testenv/Test-recursive-include.py
 create mode 100755 testenv/Test-redirect.py

diff --git a/testenv/Makefile.am b/testenv/Makefile.am
index faf86a9..deef18e 100644
--- a/testenv/Makefile.am
+++ b/testenv/Makefile.am
@@ -73,6 +73,9 @@ if HAVE_PYTHON3
 Test-pinnedpubkey-pem-fail-https.py \
 Test-pinnedpubkey-pem-https.py  \
 Test-Post.py\
+Test-recursive-basic.py \
+Test-recursive-include.py   \
+Test-redirect.py\
 Test-redirect-crash.py  \
 Test--rejected-log.py   \
 Test-reserved-chars.py  \
diff --git a/testenv/Test-recursive-basic.py b/testenv/Test-recursive-basic.py
new file mode 100755
index 000..f425ea2
--- /dev/null
+++ b/testenv/Test-recursive-basic.py
@@ -0,0 +1,57 @@
+#!/usr/bin/env python3
+from sys import exit
+from test.http_test import HTTPTest
+from test.base_test import HTTP, HTTPS
+from misc.wget_file import WgetFile
+
+"""
+Basic test of --recursive.
+"""
+# File Definitions ###
+File1 = """
+text
+text
+"""
+File2 = "With lemon or cream?"
+File3 = "Surely you're joking Mr. Feynman"
+
+File1_File = WgetFile ("a/File1.html", File1)
+File2_File = WgetFile ("a/File2.html", File2)
+File3_File = WgetFile ("b/File3.html", File3)
+
+WGET_OPTIONS = "--recursive --no-host-directories"
+WGET_URLS = [["a/File1.html"]]
+
+Servers = [HTTP]
+
+Files = [[File1_File, File2_File, File3_File]]
+Existing_Files = []
+
+ExpectedReturnCode = 0
+ExpectedDownloadedFiles = [File1_File, File2_File, File3_File]
+Request_List = [["GET /a/File1.html",
+ "GET /a/File2.html",
+ "GET /b/File3.html"]]
+
+ Pre and Post Test Hooks #
+pre_test = {
+"ServerFiles"   : Files,
+"LocalFiles": Existing_Files
+}
+test_options = {
+"WgetCommands"  : WGET_OPTIONS,
+"Urls"  : WGET_URLS
+}
+post_test = {
+"ExpectedFiles" : ExpectedDownloadedFiles,
+"ExpectedRetcode"   : ExpectedReturnCode
+}
+
+err = HTTPTest (
+pre_hook=pre_test,
+test_params=test_options,
+post_hook=post_test,
+protocols=Servers
+).begin ()
+
+exit (err)
diff --git a/testenv/Test-recursive-include.py b/testenv/Test-recursive-include.py
new file mode 100755
index 000..1fe33cd
--- /dev/null
+++ b/testenv/Test-recursive-include.py
@@ -0,0 +1,56 @@
+#!/usr/bin/env python3
+from sys import exit
+from test.http_test import HTTPTest
+from test.base_test import HTTP, HTTPS
+from misc.wget_file import WgetFile
+
+"""
+Basic test of --recursive.
+"""
+# File Definitions ###
+File1 = """
+text
+text
+"""
+File2 = "With lemon or cream?"
+File3 = "Surely you're joking Mr. Feynman"
+
+File1_File = WgetFile ("a/File1.html", File1)
+File2_File = WgetFile ("a/File2.html", File2)
+File3_File = WgetFile ("b/File3.html", File3)
+
+WGET_OPTIONS = "--recursive --no-host-directories --include-directories=a"
+WGET_URLS = [["a/File1.html"]]
+
+Servers = [HTTP]
+
+Files = [[File1_File, File2_File, File3_File]]
+Existing_Files = []
+
+ExpectedReturnCode = 0
+ExpectedDownloadedFiles = [File1_File, File2_File]
+Request_List = [["GET /a/File1.html",
+ "GET /a/File2.html"]]
+
+ Pre and Post Test Hooks #
+pre_test = {
+"ServerFiles"   : Files,
+"LocalFiles": Existing_Files
+}
+test_options = {
+"WgetCommands"  : WGET_OPTIONS,
+"Urls"  : WGET_URLS
+}
+post_test = {
+"ExpectedFiles" :

[Bug-wget] [PATCH] Sort test names into order.

2016-08-24 Thread Dale R. Worley

This patch sorts the test names in the list in testenv/Makefile.am.

Dale
>From e116595d66710c47dbee156de3cef4c6b2733fb1 Mon Sep 17 00:00:00 2001
From: "Dale R. Worley" <wor...@ariadne.com>
Date: Tue, 23 Aug 2016 18:05:11 -0400
Subject: [PATCH 2/4] Sort test names into order.

---
 testenv/Makefile.am | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/testenv/Makefile.am b/testenv/Makefile.am
index 33ce61b..faf86a9 100644
--- a/testenv/Makefile.am
+++ b/testenv/Makefile.am
@@ -42,7 +42,8 @@ if WITH_SSL
 endif
 
 if HAVE_PYTHON3
-  TESTS = Test-auth-basic-fail.py   \
+  TESTS = Test-504.py   \
+Test-auth-basic-fail.py \
 Test-auth-basic.py  \
 Test-auth-both.py   \
 Test-auth-digest.py \
@@ -51,6 +52,7 @@ if HAVE_PYTHON3
 Test-auth-retcode.py\
 Test-auth-with-content-disposition.py   \
 Test-c-full.py  \
+Test-condget.py \
 Test-Content-disposition-2.py   \
 Test-Content-disposition.py \
 Test--convert-links--content-on-error.py\
@@ -59,24 +61,22 @@ if HAVE_PYTHON3
 Test-cookie-expires.py  \
 Test-cookie.py  \
 Test-Head.py\
+Test-hsts.py\
 Test--https.py  \
 Test--https-crl.py  \
 Test-missing-scheme-retval.py   \
+Test-O.py   \
 Test-pinnedpubkey-der-https.py  \
 Test-pinnedpubkey-der-no-check-https.py \
 Test-pinnedpubkey-hash-https.py \
 Test-pinnedpubkey-hash-no-check-fail-https.py   \
 Test-pinnedpubkey-pem-fail-https.py \
 Test-pinnedpubkey-pem-https.py  \
-Test-hsts.py\
-Test-O.py   \
 Test-Post.py\
-Test-504.py \
-Test--spider-r.py   \
-Test--rejected-log.py   \
 Test-redirect-crash.py  \
+Test--rejected-log.py   \
 Test-reserved-chars.py  \
-Test-condget.py \
+Test--spider-r.py   \
 $(METALINK_TESTS)
 
 endif
-- 
2.10.0.rc0.17.gd63263a

Re: [Bug-wget] How remove all after ? or @ (REAL PROBLEM)

2016-08-23 Thread Dale R. Worley

wor...@alum.mit.edu (Dale R. Worley) writes:
> Павел Серегов <siaro...@gmail.com> writes:
>> file on server http://site.com/style.css?v1000
>> downloaded file style.css@v1000
>>
>> How remove @v1000
>> I want result: style.css (without @v1000)
>
> The easiest way is to specify the output file name you want with
> "--output-file=style.css".

Correction, that should be "--output-document=style.css" (for a single file).

Павел Серегов <siaro...@gmail.com> writes:
> The file is not one. I'm downloading the whole site.
> wget -m -E -o wget_log.txt http://www.store-discount.ru/

The problem is that the URL contains a query part ("?v1000"), and wget
needs to record that in the file name.  In general, it is possible that
wget will download both http://site.com/style.css?v1000 and
http://site.com/style.css?v2000, and it needs separate file names for
both.

wget does not provide a facility to adjust the names of the downloaded
files, except in a few particular ways.  You will probably have to first
download all the files and then run a program to rename the downloaded
files to the names you want.

> =
> Next problem "duplicate". if use "-E"
> 
> without:
> jquery.fancybox.css@1471536869
> 
> with "-E"
> jquery.fancybox@1471536869.css
> =

That is the behavior that is prescribed for -E.  Similarly, you will
probably have to run a separate program to rename the files.

Dale

Re: [Bug-wget] How remove all after ? or @ (REAL PROBLEM)

2016-08-22 Thread Dale R. Worley

Павел Серегов  writes:
> file on server http://site.com/style.css?v1000
> downloaded file style.css@v1000
>
> How remove @v1000
> I want result: style.css (without @v1000)

The easiest way is to specify the output file name you want with
"--output-file=style.css".

Dale

Re: [Bug-wget] Wget tests

2016-08-21 Thread Dale R. Worley

wor...@alum.mit.edu (Dale R. Worley) writes:
> Tim Rühsen <tim.rueh...@gmx.de> writes:
>> Somethin went wrong... try again:
>
> I will investigate why this happened.

Ugh.  I had checked out a very old version of wget that didn't have
Makefile.am, etc.

After I fixed that, I discovered that older versions of Git do not
handle submodules correctly if the path in $PWD uses symbolic links.  So
I've retrieved a newer version of Git to be able to work on wget.

And there is at least one significant error in testenv/Test-Proto.py...

Dale

Re: [Bug-wget] Wget tests

2016-08-17 Thread Dale R. Worley

Tim Rühsen  writes:
> Somethin went wrong... try again:

Yes, you are correct.  I repeated the "git clone", and the Makefile.am
files are present in the new clone.  E.g., the tests/Makefile.am clearly
lists the Perl tests, and the Perl test files are present (e.g.,
tests/Test-auth-basic.px).

Looking at my old working directory, tests/Makefile.am is missing but
tests/Makefile is present.  (It also contains the list of tests at
"PX_TESTS =", but that is impossible to find unless you know the
variable name to look for.)  But the *.px files are missing.

I will investigate why this happened.

It appears that there is only one test for redirection behavior:
testenv/Test-redirect-crash.py.

Thanks for your assistance!

Dale

Re: [Bug-wget] Wget tests

2016-08-16 Thread Dale R. Worley

Tim Rühsen  writes:
> We use standard automake tests (search the internet if you are interested in 
> details).
>
> We have the 'legacy' tests/ directory with tests written in Perl. And the 
> 'current' testenv/ directory with tests written in Python.
>
> See tests/Makefile.am resp. testenv/Makefile.am for the list of executed 
> tests.

That points to part of the problem:  I don't have those files in the Git
repository I downloaded from savannah.gnu.edu.  ("git clone
git://git.savannah.gnu.org/wget.git")

Dale

[Bug-wget] Wget tests

2016-08-15 Thread Dale R. Worley

Can someone give me a hint how the wget tests work?  The test
directories seem to contain no high-level documentation.  As far as I
can tell, the pairs of files *.{trs,log} either are or correspond to the
various tests, but I can't find the file(s) that specify what the test
invocations of wget are, nor what files the test HTTP server sees.

Thanks,

Dale

[Bug-wget] Scanning redirected files for links

2016-08-13 Thread Dale R. Worley

The code that I disagree with (i.e., do not scan a fetched file for
links if the ultimate URL does not pass the recursion tests) seems to
have been introduced in one commit from 2001-11-25:

$ git cat-file -p f6921edc
tree 76275b7fc2acbf9b66415cc17788755b1500b178
parent 2c41d783c62f1252701b8cb5a8adbcf8efbf0275
author hniksic  1006737108 -0800
committer hniksic  1006737108 -0800

[svn] Be careful whether we want to descend into results of redirection.
Published in .

The complete diff is below.

It doesn't appear that there are *any* tests that test the behavior of
redirections; at least, no test contains the word "redirect".

So I would like to discuss:  Should this behavior be changed
unconditionally, or should it be controlled by an option?

Dale



$ git diff f6921edc{^,}
diff --git a/src/ChangeLog b/src/ChangeLog
index c051e02..3a29317 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,8 @@
+2001-11-26  Hrvoje Niksic  
+
+   * recur.c (descend_redirect_p): New function.
+   (retrieve_tree): Make sure redirections are not blindly followed.
+
 2001-11-04  Alan Eldridge  
 
* config.h.in: added HAVE_RANDOM.
diff --git a/src/recur.c b/src/recur.c
index 3bcae52..8e71383 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -152,6 +152,9 @@ url_dequeue (struct url_queue *queue,
 
 static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int,
  struct url *, struct hash_table *));
+static int descend_redirect_p PARAMS ((const char *, const char *, int,
+  struct url *, struct hash_table *));
+
 
 /* Retrieve a part of the web beginning with START_URL.  This used to
be called "recursive retrieval", because the old function was
@@ -224,14 +227,25 @@ retrieve_tree (const char *start_url)
status = retrieve_url (url, , , NULL, );
opt.recursive = oldrec;
 
+   if (file && status == RETROK
+   && (dt & RETROKF) && (dt & TEXTHTML))
+ descend = 1;
+
if (redirected)
  {
+   /* We have been redirected, possibly to another host, or
+  different path, or wherever.  Check whether we really
+  want to follow it.  */
+   if (descend)
+ {
+   if (!descend_redirect_p (redirected, url, depth,
+start_url_parsed, blacklist))
+ descend = 0;
+ }
+
xfree (url);
url = redirected;
  }
-   if (file && status == RETROK
-   && (dt & RETROKF) && (dt & TEXTHTML))
- descend = 1;
   }
 
   if (descend
@@ -307,7 +321,8 @@ retrieve_tree (const char *start_url)
   opt.delete_after ? "--delete-after" :
   "recursive rejection criteria"));
  logprintf (LOG_VERBOSE,
-(opt.delete_after ? _("Removing %s.\n")
+(opt.delete_after
+ ? _("Removing %s.\n")
  : _("Removing %s since it should be rejected.\n")),
 file);
  if (unlink (file))
@@ -525,6 +540,43 @@ descend_url_p (const struct urlpos *upos, struct url 
*parent, int depth,
 
   return 0;
 }
+
+/* This function determines whether we should descend the children of
+   the URL whose download resulted in a redirection, possibly to
+   another host, etc.  It is needed very rarely, and thus it is merely
+   a simple-minded wrapper around descend_url_p.  */
+
+static int
+descend_redirect_p (const char *redirected, const char *original, int depth,
+   struct url *start_url_parsed, struct hash_table *blacklist)
+{
+  struct url *orig_parsed, *new_parsed;
+  struct urlpos *upos;
+  int success;
+
+  orig_parsed = url_parse (original, NULL);
+  assert (orig_parsed != NULL);
+
+  new_parsed = url_parse (redirected, NULL);
+  assert (new_parsed != NULL);
+
+  upos = xmalloc (sizeof (struct urlpos));
+  memset (upos, 0, sizeof (*upos));
+  upos->url = new_parsed;
+
+  success = descend_url_p (upos, orig_parsed, depth,
+  start_url_parsed, blacklist);
+
+  url_free (orig_parsed);
+  url_free (new_parsed);
+  xfree (upos);
+
+  if (!success)
+DEBUGP (("Redirection \"%s\" failed the test.\n", redirected));
+
+  return success;
+}
+
 
 /* Register that URL has been successfully downloaded to FILE. */
 
@@ -572,32 +624,21 @@ register_html (const char *url, const char *file)
   downloaded_html_files = slist_prepend (downloaded_html_files, file);
 }
 
-/* convert_links() is called from recursive_retrieve() after we're
-   done with an HTML file.  This call to convert_links is not complete
-   because it converts only the downloaded files, and Wget cannot know
-   which files will be downloaded afterwards.  So, if we have file
-   fileone.html

[Bug-wget] Additional ideas for wget

2016-08-11 Thread Dale R. Worley

(1) The description of the options that limit recursion needs to be
written better so that it is clear how various combinations of options
interact.

(2) I'd like to be able to make wget annotate each page with the URL
from which it was obtained, so if I later look at the file, I know its
origin.  With HTML files, it seems like it would be workable to append
to each file "\n\n".

Comments?

Dale

[Bug-wget] Recursion problem with wget

2016-08-11 Thread Dale R. Worley

In regard to my problem, http://savannah.gnu.org/bugs/?48708

The behavior stems from (what seems to me to be) an oddity in how wget
handles recursion:  If a page is fetched from a URL, but that fetch
involves no HTTP redirection, then the embedded links are tested against
the recursion criteria to see if they should be fetched.  But if the
page fetch involves redirection, the page is fetched, but if the
ultimate URL of the redirection does not itself pass the recursion
criteria, the links in the page are not considered, even if they pass
the recursion criteria.

My preferred behavior is that all pages that are retrieved are scanned
for embedded links in any case.

The behavior can be "corrected" straightforwardly:

diff --git a/src/recur.c b/src/recur.c
index 2b17e72..91cc585 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -360,6 +361,7 @@ retrieve_tree (struct url *start_url_parsed, struct iri 
*pi)
 {
   reject_reason r = descend_redirect (redirected, 
url_parsed,
 depth, start_url_parsed, 
blacklist, i);
+  r = WG_RR_SUCCESS;
   if (r == WG_RR_SUCCESS)
 {
   /* Make sure that the old pre-redirect form gets

This, of course, isn't the proper and final fix.

It seems to me that making this change in the code would change its
behavior sufficiently that we would have to worry about backward
compatibility.  Ideally, I'd like the new default behavior to be my
preferred behavior, and use an option to restore the previous behavior.
But it might be necessary to use an option to enable my preferred
behavior to prevent disruption.

Interestingly, "make check" *succeeds* with the above code change, so
the test suite is *not* testing for this behavior.

Comments?

Dale

Re: [Bug-wget] What ought to be a simple use of wget

2016-08-05 Thread Dale R. Worley

I've tried some further experiments.  One thing I realized was that the
"protocol" directory contains only two files, one of which I wanted, so
I could get very close to the ideal with

$ wget -r --include-directories=/assignments,/protocols 
http://www.iana.org/protocols/index.html

Unfortunately, the web site is constructed to confound that, because
protocols/index.html redirects *also* -- to
http://www.iana.org/protocols!  (Despite that that's the name of a
directory also.)  There's no way to retrieve the root HTML file without
wget considering its "directory" to be "http://www.iana.org/;.

So there's no nice solution without either revising the web site or
changing wget's behavior.

Dale

Re: [Bug-wget] What ought to be a simple use of wget

2016-08-03 Thread Dale R. Worley

Ander Juaristi  writes:
> I'm seeing it always redirects to www.iana.org/protocols
>
> Would -A protocols work for you?
>
> e.g
> wget mirror --convert-links --no-parent --page-requisites -A
> protocols http://www.iana.org/protocols

That gets a lot of messages like:

   Removing www.iana.org/assignments/ancp/ancp.xhtml since it should be 
rejected.

As far as I can tell, that's because the file name doesn't end in
"protocols".

Dale

Re: [Bug-wget] What ought to be a simple use of wget

2016-08-03 Thread Dale R. Worley

Tim Rühsen  writes:
> If you have a look at 'man wget'/--page-requisites, the stuff is explained 
> quite well. To me it looks like you are missing --level 2.
>
> If --level 2 is not what you want. you could make your point clear by making 
> up a small document tree as an example.

I definitely don't want --level 2, because that limits how many links
the recursion can traverse.  If all the links are within the
/assignments/ directory, wget should follow an unlimited number.

Here's an outline of what I want retrieved, based on Matthew White's
listing:

www.iana.org/
Some or all of these files are OK, since they're likely page requisites:
www.iana.org/_css/
www.iana.org/_css/2015.1/
www.iana.org/_css/2015.1/print.css
www.iana.org/_css/2015.1/screen.css
www.iana.org/_img/
www.iana.org/_img/2011.1/
www.iana.org/_img/2011.1/icons/
...
www.iana.org/_js/
www.iana.org/_js/2013.1/
www.iana.org/_js/2013.1/iana.js
www.iana.org/_js/2013.1/jquery.js
Nothing in these directories:
www.iana.org/about/
www.iana.org/abuse/
Lots and lots of files in this directory:
www.iana.org/assignments/
www.iana.org/assignments/_6lowpan-parameters/
www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml.html
www.iana.org/assignments/_support/
www.iana.org/assignments/_support/iana-registry.css
www.iana.org/assignments/_support/jquery.js
www.iana.org/assignments/_support/sort.js
www.iana.org/assignments/aaa-parameters/
www.iana.org/assignments/aaa-parameters/aaa-parameters-1.csv
www.iana.org/assignments/aaa-parameters/aaa-parameters.txt
www.iana.org/assignments/aaa-parameters/aaa-parameters.xhtml.html
www.iana.org/assignments/aaa-parameters/aaa-parameters.xml
www.iana.org/assignments/abfab-parameters/
www.iana.org/assignments/abfab-parameters/abfab-parameters.txt
www.iana.org/assignments/abfab-parameters/abfab-parameters.xhtml.html
www.iana.org/assignments/abfab-parameters/abfab-parameters.xml
www.iana.org/assignments/abfab-parameters/urn-parameters.csv
...
Nothing in these directories:
www.iana.org/dnssec/
www.iana.org/domains/
www.iana.org/go/
www.iana.org/help/
www.iana.org/numbers/
www.iana.org/procedures/
www.iana.org/protocols/
www.iana.org/reports/

Dale

Re: [Bug-wget] What ought to be a simple use of wget

2016-08-03 Thread Dale R. Worley

Matthew White  writes:
> wget --recursive   \
>  --page-requisites \
>  --convert-links   \
>  --domains="www.iana.org"  \
>  --reject "robots.txt","reports","contact" \
>  
> --exclude-directories="/go,/assignments,/_img,/_js,/_css,/domains,/performance,/about,/protocols,/procedures,/dnssec,/reports,/help,/abuse,/numbers,/reviews,/time-zones,/2000,/2001"
>  \
> http://www.iana.org/assignments/index.html

True, using --exclude-directories I can isolate what I want, but as you
note, that requires actually knowing all of the children of the root in
advance.  Whereas it seems to me that there should be a straightforward
way of instructing wget to exclude "everything but X".

> wget --recursive  \
>  --no-clobber \
>  --page-requisites\
>  --adjust-extension   \
>  --convert-links  \
>  --span-hosts \
>  --domains="www.iana.org" \
>  http://www.iana.org/assignments/index.html

As you said, that command returned lots of things that aren't in
http://www.iana.org/assignments.

Dale

64 matches

Mail list logo