[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file

2015-03-31 Thread Tim Ruehsen
Follow-up Comment #3, bug #44674 (project wget):

Just open a second console and start
  nc -l -p 

Start wget in your first console
  http_proxy=localhost: wget http://www.example.com

nc will now dump everything that Wget sends. You could even generate an answer
(e.g. with copy  paste).

Wget just adds a Proxy-Connection header which will not be sent on non-proxy
connections.


___

Reply to this item at:

  http://savannah.gnu.org/bugs/?44674

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




[Bug-wget] Wget 1.16.3 v. VMS

2015-03-31 Thread Steven M. Schweda
   I don't know how long this has been true, but I recently noticed that
some recursive HTTP fetch operations were failing (on VMS) because the
URLs contained a ?, and the code in src/url.c (et al.) thought that
this was a problem in file names on only Windows.  For example (1.16.3):

ALP $ wgo --user-agent=mozilla
http://www.google.com/search?source=hpq=fred;
--2015-03-31 23:52:13--  http://www.google.com/search?source=hpq=fred
Resolving www.google.com... 74.125.198.99, 74.125.198.103,
74.125.198.104, ...
Connecting to www.google.com|74.125.198.99|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
search!source=hpq=fred: i/o error

Cannot write to 'search!source=hpq=fred' (error 0).

(Interestingly, in 1.16.1, that last error message was more informative:
Cannot write to 'search!source=hpq=fred' (i/o error).
but I haven't investigated.)

   Adding a VMS option to the restrict_files_os stuff, and treating VMS
like Windows for FN_QUERY_SEP and FN_QUERY_SEP_STR seems to solve the
problem (at least on an ODS5 volume):

ALP $ wgx --user-agent=mozilla http://www.google.com/search?source=hpq=fred;
--2015-03-31 23:39:35--  http://www.google.com/search?source=hpq=fred
Resolving www.google.com... 74.125.198.147, 74.125.198.99, 74.125.198.103, ...
Connecting to www.google.com|74.125.198.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'search@source=hpq=fred'

search@source=hpq= [  = ]  37.78K   174KB/s   in 0.2s

2015-03-31 23:39:36 (174 KB/s) - 'search@source=hpq=fred' saved [38691]

ALP $ dire search*
[...]
search^@source^=hp^q^=fred.;1


   I haven't looked at the documentation, but the following code
changes seem plausible to me:

diff -ru wget-1_16_3a_vms/src/init.c wget-1_16_3/src/init.c
--- wget-1_16_3a_vms/src/init.c 2015-01-30 17:25:57 -0600
+++ wget-1_16_3/src/init.c  2015-03-31 22:46:59 -0500
@@ -397,6 +397,8 @@
   /* The default for file name restriction defaults to the OS type. */
 #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
   opt.restrict_files_os = restrict_windows;
+#elif defined(__VMS)
+  opt.restrict_files_os = restrict_vms;
 #else
   opt.restrict_files_os = restrict_unix;
 #endif
@@ -1481,6 +1483,8 @@
 
   if (VAL_IS (unix))
 restrict_os = restrict_unix;
+  else if (VAL_IS (vms))
+restrict_os = restrict_vms;
   else if (VAL_IS (windows))
 restrict_os = restrict_windows;
   else if (VAL_IS (lowercase))
@@ -1495,7 +1499,7 @@
 {
   fprintf (stderr, _(\
 %s: %s: Invalid restriction %s,\n\
-use [unix|windows],[lowercase|uppercase],[nocontrol],[ascii].\n),
+use [unix|vms|windows],[lowercase|uppercase],[nocontrol],[ascii].\n),
exec_name, com, quote (val));
   return false;
 }
diff -ru wget-1_16_3a_vms/src/options.h wget-1_16_3/src/options.h
--- wget-1_16_3a_vms/src/options.h  2015-01-30 17:25:57 -0600
+++ wget-1_16_3/src/options.h   2015-03-31 22:37:59 -0500
@@ -239,6 +239,7 @@
 
   enum {
 restrict_unix,
+restrict_vms,
 restrict_windows
   } restrict_files_os;  /* file name restriction ruleset. */
   bool restrict_files_ctrl; /* non-zero if control chars in URLs
diff -ru wget-1_16_3a_vms/src/url.c wget-1_16_3/src/url.c
--- wget-1_16_3a_vms/src/url.c  2015-02-23 09:10:22 -0600
+++ wget-1_16_3/src/url.c   2015-03-31 23:09:48 -0500
@@ -1328,8 +1328,9 @@
 
 enum {
   filechr_not_unix= 1,  /* unusable on Unix, / and \0 */
-  filechr_not_windows = 2,  /* unusable on Windows, one of \|/?:* */
-  filechr_control = 4   /* a control character, e.g. 0-31 */
+  filechr_not_vms = 2,  /* unusable on VMS (ODS5), 0x00-0x1F * ? */
+  filechr_not_windows = 4,  /* unusable on Windows, one of \|/?:* */
+  filechr_control = 8   /* a control character, e.g. 0-31 */
 };
 
 #define FILE_CHAR_TEST(c, mask) \
@@ -1338,11 +1339,14 @@
 
 /* Shorthands for the table: */
 #define U filechr_not_unix
+#define V filechr_not_vms
 #define W filechr_not_windows
 #define C filechr_control
 
+#define UVWC U|V|W|C
 #define UW U|W
-#define UWC U|W|C
+#define VC V|C
+#define VW V|W
 
 /* Table of characters unsafe under various conditions (see above).
 
@@ -1353,22 +1357,22 @@
 
 static const unsigned char filechr_table[256] =
 {
-UWC,  C,  C,  C,   C,  C,  C,  C,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
-  C,  C,  C,  C,   C,  C,  C,  C,   /* BS  HT  LF  VT   FF  CR  SO  SI  */
-  C,  C,  C,  C,   C,  C,  C,  C,   /* DLE DC1 DC2 DC3  DC4 NAK SYN ETB */
-  C,  C,  C,  C,   C,  C,  C,  C,   /* CAN EM  SUB ESC  FS  GS  RS  US  */
-  0,  0,  W,  0,   0,  0,  0,  0,   /* SP  !  #$   %  '   */
-  0,  0,  W,  0,   0,  0,  0, UW,   /* (   )   *   +,   -   .   /   */
-  0,  0,  0,  0,   0,  0,  0,  0,   /* 0   1   2   34   5   6   7   */
-  0,  0,  W,  0,   W,  0,  W,  W,   /* 8   9   :   ;   =  ?   */
- 

[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file

2015-03-31 Thread INVALID.NOREPLY
Follow-up Comment #6, bug #44674 (project wget):

Also --debug doesn't show full FORM bodies.

___

Reply to this item at:

  http://savannah.gnu.org/bugs/?44674

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




[Bug-wget] [Patch] fix bug #44628 not honoring RFC 6266 in --content-disposition

2015-03-31 Thread Miquel Llobet
The problem was that parse_content_disposition treated the values
filename and filename* as the same and concatenated both values. RFC
6266 states that if both are present, filename* should have preference.

Now, the patch is tricky, because in this case:

attachment; filename*=A.ext; filename*0=hello; filename*1=world.ext,

the final filename should be A.ext.

But modify_param_name in http.c changes the names with '*' to end with the
first occurrence of '*', this makes it hard to differentiate both cases.
Instead of fixing this and dealing with lots of edge cases I just check for
the name to be filename, end with * and have the next character not to
be a digit. Do you think this is fine or it's best to get modify_param_name
to do it's job better?

I also added more unit tests to account for the edge cases.

Cheers,
Miquel


0001-Fixed-44628-honoring-RFC-6266-content-disposition.patch
Description: Binary data


Re: [Bug-wget] gethttp cleanup

2015-03-31 Thread Giuseppe Scrivano
Hi Hubert,

Hubert Tarasiuk hubert.taras...@gmail.com writes:

 I have identified a potential drawback with the function
 `establish_connection`.

 [Patch #3]
 On error, it would free the `req` variable, but it never zeroed
 `*req_ref`. As the matter of fact, it only wrote to `req_ref` on
 successful exit (when it did not actually change).
 I suggest that this function never frees the `req` variable, because it
 never allocates it. (As opposed to `connreq`.)
 Instead, the caller (`gethttp`) releases the `req` when error occured.

 [Patch #4]
 My second idea is to change semantics of `resp_free` and `request_free`,
 so that they are similar to `xfree`, i.e.:
 1) it is safe to call them with a NULL pointer
 2) they ensure that the pointer is set to NULL after the call
 In order to achieve (2), a pointer to a pointer has to be passed. 
 (Please note, that this patch depends on previous.)

thanks for your patches, I agree with you and the code looks nicer now.

I am going to push them tomorrow if nobody complains before (with the
following fix amended to your last patch).

diff --git a/src/http.c b/src/http.c
index 5338d20..9994d13 100644
--- a/src/http.c
+++ b/src/http.c
@@ -2506,7 +2506,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, 
struct url *proxy,
using_ssl, inhibit_keep_alive, sock);
 if (err != RETROK)
   {
-request_free (req);
+request_free (req);
 return err;
   }
   }


Thanks for your contribution!

Giuseppe



Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Tim Rühsen
Hi Steven,

Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells:
 Dear all - I am currently trying to use wget to obtain mp3 files from the
 Google Translate TTS system. In principle this can be done using:
 
 wget -U Mozilla -O ${string}.mp3 
 http://translate.google.com/translate_tts?tl=TLq=${string};
 
 where TL is a twoletter language code (en,fr,de and so on).
 
 However I am meeting a serious error when I try to send Russian strings
 (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
 Cygwin) and the file system will display the cyrillic strings no problem.
 If I provide a command like this:
 
 http://translate.google.com/translate_tts?tl=ruq=мазать
 
 wget incorrectly processes the Cyrillic characters _before_ sending the
 http request, so what it actually requests is:
 
 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D
 1%82%D1%8C

This seems to be the correct behavior of a web client.
The URL in the GET request is transmitted UTF-8 encoded and percent escaping 
is performed for chars 127 (not mentioning control chars here).

 This of course produces a string of gibberish in the resulting mp3 file!

This is something different. If you are talking about the file name, well 
there is --restrict-file-names=nocontrol. Did you give it a try ?

 Is there any way to make wget actually send the string it is given, instead
 of mangling it on the way out? This is really blocking me.

From what you write, I am unsure if you are talking about the resulting file 
name or about HTTP URL encoding in a GET request.

Regards, Tim


signature.asc
Description: This is a digitally signed message part.


[Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
Dear all - I am currently trying to use wget to obtain mp3 files from the
Google Translate TTS system. In principle this can be done using:

wget -U Mozilla -O ${string}.mp3 
http://translate.google.com/translate_tts?tl=TLq=${string};

where TL is a twoletter language code (en,fr,de and so on).

However I am meeting a serious error when I try to send Russian strings
(tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
Cygwin) and the file system will display the cyrillic strings no problem.
If I provide a command like this:

http://translate.google.com/translate_tts?tl=ruq=мазать

wget incorrectly processes the Cyrillic characters _before_ sending the
http request, so what it actually requests is:


http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C

This of course produces a string of gibberish in the resulting mp3 file!

Is there any way to make wget actually send the string it is given, instead
of mangling it on the way out? This is really blocking me.

Cheers,
Stephen


[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file

2015-03-31 Thread INVALID.NOREPLY
Follow-up Comment #5, bug #44674 (project wget):

Tim: OK please somebody be sure that example is given nearby the --debug
option section of the man page.

Also it would be good if there was a built-in way to do it in case it is
inconvenient to install other programs or do extra input output job
starting and waiting on a given system.

Anonymous: the --debug part of the man page doesn't say clearly what it
will give, also --debug might not be compiled in. And in fact --debug
gives more than just the request, and --debug needs one to attempt the
request without any --dry-run safety mechanism before going on to the net...

___

Reply to this item at:

  http://savannah.gnu.org/bugs/?44674

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file

2015-03-31 Thread anonymous
Follow-up Comment #4, bug #44674 (project wget):

You can use the --debug flag to show the HTTP request and response headers,
including when the traffic is encrypted with SSL.

___

Reply to this item at:

  http://savannah.gnu.org/bugs/?44674

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
Hi Tim,

Sorry for the ambiguity. To be more specific, the file name is fine: in the
shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 .
The audio within the file consists of the Google robot voice reading the
string of percent-escaped characters literally, not reading the Russian
word.

I will try Random Coder's suggestion of a more complete user agent string -
 apparently http://whatsmyuseragent.com/ is a handy way to find out what
your browser claims to be :)

On Tue, Mar 31, 2015 at 9:50 PM, Tim Rühsen tim.rueh...@gmx.de wrote:

 Hi Steven,

 Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells:
  Dear all - I am currently trying to use wget to obtain mp3 files from the
  Google Translate TTS system. In principle this can be done using:
 
  wget -U Mozilla -O ${string}.mp3 
  http://translate.google.com/translate_tts?tl=TLq=${string};
 
  where TL is a twoletter language code (en,fr,de and so on).
 
  However I am meeting a serious error when I try to send Russian strings
  (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
  Cygwin) and the file system will display the cyrillic strings no problem.
  If I provide a command like this:
 
  http://translate.google.com/translate_tts?tl=ruq=мазать
 
  wget incorrectly processes the Cyrillic characters _before_ sending the
  http request, so what it actually requests is:
 
 
 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D
  1%82%D1%8C

 This seems to be the correct behavior of a web client.
 The URL in the GET request is transmitted UTF-8 encoded and percent
 escaping
 is performed for chars 127 (not mentioning control chars here).

  This of course produces a string of gibberish in the resulting mp3 file!

 This is something different. If you are talking about the file name, well
 there is --restrict-file-names=nocontrol. Did you give it a try ?

  Is there any way to make wget actually send the string it is given,
 instead
  of mangling it on the way out? This is really blocking me.

 From what you write, I am unsure if you are talking about the resulting
 file
 name or about HTTP URL encoding in a GET request.

 Regards, Tim



Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
THANK YOU Random Coder! That did the trick. Apparently my earlier attempts
were unsuccessful because the problem I was trying to solve was not the
problem I actually had :)

Specifically I went to whatsmyuseragent.com and my browser id'd as
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2272.101 Safari/537.36 . I put that, in quotes, instead of
just Mozilla as the argument of the -U option, and now I get back an mp3
file with proper Russian audio in it. Much victory.


Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Ángel González

On 01/04/15 00:16, Stephen Wells wrote:

Hi Tim,

Sorry for the ambiguity. To be more specific, the file name is fine: in the
shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 .
The audio within the file consists of the Google robot voice reading the
string of percent-escaped characters literally, not reading the Russian
word.

I will try Random Coder's suggestion of a more complete user agent string -
  apparently http://whatsmyuseragent.com/ is a handy way to find out what
your browser claims to be :)


I remember google had a parameter for the encoding. It may be worth 
explicitly noting that it's utf-8, it may be using a fallback based on 
the User-Agent.





Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Random Coder
On Tue, Mar 31, 2015 at 10:11 AM, Stephen Wells sawells.2...@gmail.com wrote:
 Dear all - I am currently trying to use wget to obtain mp3 files from the
 Google Translate TTS system. In principle this can be done using:

 wget -U Mozilla -O ${string}.mp3 
 http://translate.google.com/translate_tts?tl=TLq=${string};

 ...

 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C

 This of course produces a string of gibberish in the resulting mp3 file!


That URL is correct, it's what you'll see a browser send across the
wire for the same string.  Google is producing gibberish because of
some User-agent sniffing that they appear to be doing.

If you change the user agent to something that's more complete, like
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36 instead of just Mozilla, it should
work correctly.