Re: [Bug-wget] GSoC project proposals, speed up

2015-03-12 Thread Darshit Shah
Hi Laura,

>
> It is great to see we are many students taking an interest in Wget. When I
> went through the list of proposed projects, yours really caught up my eye,
> C, protocols, unambiguous and certainly useful, oh yes! :)

Thanks a lot for those kind words. Trust me when I say, we're
overwhelmed with the response we've seen for GSoC this year. I can
only hope the same level of enthusiasm remains throughout the period
and we can convert a few of these students into regular Wget
developers.

>
> I have been exploring the speed up ideas, namely the if-modified-since
> headers and the TCP Fast Open implementation and I would like to make sure
> I am walking in the right direction, both with the approach and the
> assumptions.
>
> *if-modified-since*
>
> The idea is to reduce the amount of requests to obtain modified documents,
> moving from the current three steps (HEAD, last-modified check and GET) to
> the new conditional header. This should include a better handling of the
> possible responses as well, like HTTP_STATUS_NOT_MODIFIED, that seems to be
> defined but not treated. Plus a new argument (e.g. --if-modified), config,
> tests...

You're mostly on track with this one. I'm not sure if we want a
--if-modified switch. Instead, the use of the if-modified-since header
should be enabled in all cases where it is relevant. However, this
does not mean that the old timestamp checking is no longer used. A lot
of websites do not support if-modified-since and we do not wish to
change the default behavior of Wget. Hence, both the checks should
exist, but if-modified-since gets the upper priority.

>
> It is a nice improvement for particular applications, e.g. efficient
> updates for caches or time-saving crawlers, and for an overall bandwidth
> reduction.
>
> *TCP Fast Open*
>
> On the other hand, TFO has a wider application; of course it lives in a
> lower level. TFO allows servers to start sending their responses directly
> after the SYN/ACK message, without waiting for the third handshake. It is
> based on the exchange of a secure token/cookie during the first connection
> and saving one RTT per request after.
>
> Particularly for HTTP requests, with short data flows, the overall impact
> can be very high (the RFC estimates it up to 40%, which sounds like forcing
> a bit too much the best case). Taking some measures after the
> implementation to verify it will complement the project nicely.
>
> Linux contains the full implementation of TFO and, since 3.13, it is
> enabled by default. The rest of most common OS don't support it (Windows,
> Mac OS); but others are considering it (FreeBSD), maybe by summer...
>
> For this task, I see I should introduce the MSG_FASTOPEN flag to the calls,
> moving from connect() to sendmsg()/sendto().Should this become a default or
> should it be configurable? It sounds like the kind of thing that could
> leave in your .wgetrc, but I honestly don't find any reason to force the
> conventional TCP. It should just happen automatically if the remote server
> doesn't support TFO.

I think we would like TFO to be the default option when it is
supported by the server. The first thing Wget needs to do is identify
if the server supports TFO and then calibrate the remaining requests
accordingly. You're right though, there's no reason to force
conventional TCP if both the ends support TFO. You must also try to
account for the situation where Wget is interacting with a proxy that
supports TFO, but the actual end of the connection doesn't. These are
corner cases and shouldn't be your major focus right now, but it helps
to keep them in mind.
>
> I would love to hear your ideas and comments to improve upon my proposal
> draft. In the meantime, I will start reading the codebase and try fixing
> small bugs, as already suggested in the list.

I'm assuming you've seen the GitHub wiki page for GSoC '15? You've
started out on the right track towards your proposal. Eventually it
will require a more detailed discussion and a timeline. But we will
work on those things later.
>
> Many thanks in advance,
>
> Laura



-- 
Thanking You,
Darshit Shah



[Bug-wget] [PATCH] Bug 40426 follow-up

2015-03-12 Thread Ander Juaristi

This is my first attempt to fix 40426. I'm expecting to continue to work on it 
until it's fixed, unless someone has objections.

Source of the problem:

When -r and -O - are specified together, Wget hangs. Well, it really doesn't 
hang, it's actually waiting for input. When Wget downloads a file it saves it 
to the disk first and then reads it again from there to parse it and get more 
URLs, if any. But when '-O -' was specified, Wget reads from stdin. This is 
done in wget_read_file(). As a funny game, when this happens, type some HTML, 
and then hit Ctrl-D or any other key sequence equivalent to EOF, and you'll 
effectively trick Wget into inserting arbitrary HTML.

Solution:

A patch was already proposed that simply avoided '-r' and '-O -' to be set together. But that would 
void a clause in the documentation that states that "wget -O file http://foo"; is intended to 
work like "wget -O - http://foo > file". So, my approach is to maintain that.

Initially I thought of redirecting stdout to stdin, but that would be an ugly 
hack that would probably make things difficult in the future. What's more, 
there are options that rely on stdin, such as '--input-file=-'.

So, what I did is to write to a regular file in /tmp/.wget.stdout instead of in 
stdout when '-O -' is passed, and dump everything in the end. This, apart from 
fixing the bug, maintains the documented behavior.

Although this is a good workaround for the short term, IMO the best approach is 
to keep everything in memory and write stuff out in the end. Something similar 
was already reported at 20714.

I don't expect this patch to be definite. This is just a follow-up to my work. 
I'm looking forward to your feedback as I sure haven't taken into account all 
the consequences.

So far, the following are still to be done:
  - /tmp/.wget.stdout won't work on Windows, so port it.
  - FOPEN_OPT_ARGS is not taken into account.
  - I only took into account HTTP. The same workaround should be applied to FTP 
too.

Here goes:

diff --git a/src/http.c b/src/http.c
index b7020ef..bc7c1e9 100644
--- a/src/http.c
+++ b/src/http.c
@@ -3080,7 +3080,7 @@ http_loop (struct url *u, struct url *original_url, char 
**newloc,
 
   /* Set LOCAL_FILE parameter. */

   if (local_file && opt.output_document)
-*local_file = HYPHENP (opt.output_document) ? NULL : xstrdup 
(opt.output_document);
+*local_file = HYPHENP (opt.output_document) ? xstrdup (TMP_OUTFILE) : 
xstrdup (opt.output_document);
 
   /* Reset NEWLOC parameter. */

   *newloc = NULL;
@@ -3101,7 +3101,7 @@ http_loop (struct url *u, struct url *original_url, char 
**newloc,
 
   if (opt.output_document)

 {
-  hstat.local_file = xstrdup (opt.output_document);
+  hstat.local_file = HYPHENP (opt.output_document) ? xstrdup (TMP_OUTFILE) 
: xstrdup (opt.output_document);
   got_name = true;
 }
   else if (!opt.content_disposition)
diff --git a/src/main.c b/src/main.c
index b23967b..1bf7565 100644
--- a/src/main.c
+++ b/src/main.c
@@ -1582,7 +1582,17 @@ for details.\n\n"));
 #ifdef WINDOWS
   _setmode (_fileno (stdout), _O_BINARY);
 #endif
-  output_stream = stdout;
+  // TODO We should take care of FOPEN_OPT_ARGS
+  output_stream = fopen (TMP_OUTFILE, "wb+");
+  if (output_stream == NULL)
+{
+  perror (TMP_OUTFILE);
+  exit (WGET_EXIT_GENERIC_ERROR);
+}
+  /*
+   * We know it's a regular file. No need to check.
+   */
+  output_stream_regular = true;
 }
   else
 {
@@ -1762,8 +1772,24 @@ outputting to a regular file.\n"));
   if (opt.convert_links && !opt.delete_after)
 convert_all_links ();
 
+  /* If output file is stdout (`-O -' was specified), obey.

+ Print stuff out.  */
+  if (HYPHENP (opt.output_document) && output_stream && (total_downloaded_bytes 
> 0))
+{
+  struct file_memory *fm = wget_read_file (TMP_OUTFILE);
+  if (fm)
+{
+  write (fileno (stdout), fm->content, fm->length);
+  wget_read_file_free (fm);
+}
+}
+
   cleanup ();
 
+  /* Delete the temporary output file  */

+  if (HYPHENP (opt.output_document) && unlink (TMP_OUTFILE))
+logprintf (LOG_NOTQUIET, "Temporary output file was not deleted. Should be done 
by hand. %s\n", strerror (errno));
+
   exit (get_exit_status ());
 }
 
diff --git a/src/wget.h b/src/wget.h

index 8d2b0f1..9d40255 100644
--- a/src/wget.h
+++ b/src/wget.h
@@ -126,6 +126,12 @@ as that of the covered work.  */
 
 #define DEBUGP(args) do { IF_DEBUG { debug_logprintf args; } } while (0)
 
+/* If we're outputting to stdout ('-O -' was specified), we'd rather

+   write to a temporary file and output everything in the end, since
+   Wget still needs the downloaded files to be regular in order to
+   parse them.  */
+#define TMP_OUTFILE "/tmp/.wget.stdout"
+
 /* Pick an integer type large enough for file sizes, content lengths,
and such.  Because tod

[Bug-wget] Working on Wget for GSoC 2015

2015-03-12 Thread Ameya Marathe
Hi,

I'm Ameya, a final year engineering student at College of Engineering,
Pune, India. I started looking forward to contributing to some GNU project
the day I completed the Linux From Scratch project. Wget is of special
interest to me since I've written some shell scripts to download files from
websites whose filenames follow some logic (like college roll numbers :P )
I would like to work on wget as my GSoC project. To get familiar with the
source, I've started looking into bug #35011 after a quick build & test of
wget code. Will try to resolve it asap.
Looking forward to a fruitful and learning experience.

Ameya.
(ameyamarathe18)


[Bug-wget] [Patch] fix bug #39175 Header value length limited with 256

2015-03-12 Thread Miquel Llobet
Increased the header buffer to 8Kb, as there are no limits to the size of
field name, values or headers themselves. While the current value is big
enough, other projects such as Apache [1] or nginx have limits of 4-8Kb.

If we want to allow for arbitrary size headers we should use
resp_header_strdup instead of resp_header_copy, but this new value should
be enough.

--- src/http.c.orig 2015-03-12 21:50:03.0 +0100
+++ src/http.c 2015-03-12 21:04:08.0 +0100
@@ -1695,7 +1695,7 @@

   char *head;
   struct response *resp;
-  char hdrval[512];
+  char hdrval[8190];
   char *message;

   /* Declare WARC variables. */

[1]: https://httpd.apache.org/docs/2.2/mod/core.html#limitrequestfieldsize

Miquel Llobet


[Bug-wget] [Patch] fix bug #44516, -o- log to stdout

2015-03-12 Thread Miquel Llobet
wget now correctly reads that -o- means logging to stdout instead of the
file '-'.
I just checked for a hyphen at log_init, didn't see any caveats to this.

--- src/log.c.origin 2015-03-13 01:32:27.0 +0100
+++ src/log.c 2015-03-13 01:44:31.0 +0100
@@ -598,11 +598,18 @@
 {
   if (file)
 {
-  logfp = fopen (file, appendp ? "a" : "w");
-  if (!logfp)
+  if (HYPHENP (file))
 {
-  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
(errno));
-  exit (WGET_EXIT_GENERIC_ERROR);
+logfp = stdout;
+}
+  else
+{
+  logfp = fopen (file, appendp ? "a" : "w");
+  if (!logfp)
+{
+  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
(errno));
+  exit (WGET_EXIT_GENERIC_ERROR);
+}
 }
 }
   else

Miquel Llobet


Re: [Bug-wget] [Patch] fix bug #44516, -o- log to stdout

2015-03-12 Thread Miquel Llobet
removed braces from the second if statement, as per GNU's coding standards

--- src/log.c.origin 2015-03-13 01:32:27.0 +0100
+++ src/log.c 2015-03-13 02:28:25.0 +0100
@@ -598,11 +598,16 @@
 {
   if (file)
 {
-  logfp = fopen (file, appendp ? "a" : "w");
-  if (!logfp)
+  if (HYPHENP (file))
+logfp = stdout;
+  else
 {
-  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
(errno));
-  exit (WGET_EXIT_GENERIC_ERROR);
+  logfp = fopen (file, appendp ? "a" : "w");
+  if (!logfp)
+{
+  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
(errno));
+  exit (WGET_EXIT_GENERIC_ERROR);
+}
 }
 }
   else


Miquel Llobet



On Fri, Mar 13, 2015 at 2:04 AM, Miquel Llobet  wrote:

> wget now correctly reads that -o- means logging to stdout instead of the
> file '-'.
> I just checked for a hyphen at log_init, didn't see any caveats to this.
>
> --- src/log.c.origin 2015-03-13 01:32:27.0 +0100
> +++ src/log.c 2015-03-13 01:44:31.0 +0100
> @@ -598,11 +598,18 @@
>  {
>if (file)
>  {
> -  logfp = fopen (file, appendp ? "a" : "w");
> -  if (!logfp)
> +  if (HYPHENP (file))
>  {
> -  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
> (errno));
> -  exit (WGET_EXIT_GENERIC_ERROR);
> +logfp = stdout;
> +}
> +  else
> +{
> +  logfp = fopen (file, appendp ? "a" : "w");
> +  if (!logfp)
> +{
> +  fprintf (stderr, "%s: %s: %s\n", exec_name, file, strerror
> (errno));
> +  exit (WGET_EXIT_GENERIC_ERROR);
> +}
>  }
>  }
>else
>
> Miquel Llobet
>
>
>


[Bug-wget] [Patch] fix bug #40426, wget hangs with -r and -O -

2015-03-12 Thread Miquel Llobet
When wget is called with -r or -p it will look for resource tags in the
output file, and since -O- redirects to stdout, the program hangs, waiting
for input. The same happens with pipes or FIFO files.

My proposed solution is to disallow calling wget with '-p' or '-r' and
output redirection to either stdout or a FIFO file.

--- src/main.c.origin 2015-03-13 03:59:39.0 +0100
+++ src/main.c 2015-03-13 04:10:59.0 +0100
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "exits.h"
 #include "utils.h"
@@ -1335,6 +1336,18 @@
  opt.output_document);
   exit (WGET_EXIT_GENERIC_ERROR);
}
+  if (opt.recursive || opt.page_requisites)
+  {
+struct stat status;
+stat (opt.output_document, &status);
+if (HYPHENP (opt.output_document) || status.st_mode & S_IFIFO)
+  {
+fprintf (stderr, _("Can't do \
+recursive download with output redirected to stdout, pipe or FIFO
file\n"));
+print_usage (1);
+exit (WGET_EXIT_GENERIC_ERROR);
+  }
+  }
 }

   if (opt.warc_filename != 0)


Miquel Llobet