[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file
Follow-up Comment #3, bug #44674 (project wget): Just open a second console and start nc -l -p Start wget in your first console http_proxy=localhost: wget http://www.example.com nc will now dump everything that Wget sends. You could even generate an answer (e.g. with copy paste). Wget just adds a Proxy-Connection header which will not be sent on non-proxy connections. ___ Reply to this item at: http://savannah.gnu.org/bugs/?44674 ___ Message sent via/by Savannah http://savannah.gnu.org/
[Bug-wget] Wget 1.16.3 v. VMS
I don't know how long this has been true, but I recently noticed that some recursive HTTP fetch operations were failing (on VMS) because the URLs contained a ?, and the code in src/url.c (et al.) thought that this was a problem in file names on only Windows. For example (1.16.3): ALP $ wgo --user-agent=mozilla http://www.google.com/search?source=hpq=fred; --2015-03-31 23:52:13-- http://www.google.com/search?source=hpq=fred Resolving www.google.com... 74.125.198.99, 74.125.198.103, 74.125.198.104, ... Connecting to www.google.com|74.125.198.99|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] search!source=hpq=fred: i/o error Cannot write to 'search!source=hpq=fred' (error 0). (Interestingly, in 1.16.1, that last error message was more informative: Cannot write to 'search!source=hpq=fred' (i/o error). but I haven't investigated.) Adding a VMS option to the restrict_files_os stuff, and treating VMS like Windows for FN_QUERY_SEP and FN_QUERY_SEP_STR seems to solve the problem (at least on an ODS5 volume): ALP $ wgx --user-agent=mozilla http://www.google.com/search?source=hpq=fred; --2015-03-31 23:39:35-- http://www.google.com/search?source=hpq=fred Resolving www.google.com... 74.125.198.147, 74.125.198.99, 74.125.198.103, ... Connecting to www.google.com|74.125.198.147|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'search@source=hpq=fred' search@source=hpq= [ = ] 37.78K 174KB/s in 0.2s 2015-03-31 23:39:36 (174 KB/s) - 'search@source=hpq=fred' saved [38691] ALP $ dire search* [...] search^@source^=hp^q^=fred.;1 I haven't looked at the documentation, but the following code changes seem plausible to me: diff -ru wget-1_16_3a_vms/src/init.c wget-1_16_3/src/init.c --- wget-1_16_3a_vms/src/init.c 2015-01-30 17:25:57 -0600 +++ wget-1_16_3/src/init.c 2015-03-31 22:46:59 -0500 @@ -397,6 +397,8 @@ /* The default for file name restriction defaults to the OS type. */ #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__) opt.restrict_files_os = restrict_windows; +#elif defined(__VMS) + opt.restrict_files_os = restrict_vms; #else opt.restrict_files_os = restrict_unix; #endif @@ -1481,6 +1483,8 @@ if (VAL_IS (unix)) restrict_os = restrict_unix; + else if (VAL_IS (vms)) +restrict_os = restrict_vms; else if (VAL_IS (windows)) restrict_os = restrict_windows; else if (VAL_IS (lowercase)) @@ -1495,7 +1499,7 @@ { fprintf (stderr, _(\ %s: %s: Invalid restriction %s,\n\ -use [unix|windows],[lowercase|uppercase],[nocontrol],[ascii].\n), +use [unix|vms|windows],[lowercase|uppercase],[nocontrol],[ascii].\n), exec_name, com, quote (val)); return false; } diff -ru wget-1_16_3a_vms/src/options.h wget-1_16_3/src/options.h --- wget-1_16_3a_vms/src/options.h 2015-01-30 17:25:57 -0600 +++ wget-1_16_3/src/options.h 2015-03-31 22:37:59 -0500 @@ -239,6 +239,7 @@ enum { restrict_unix, +restrict_vms, restrict_windows } restrict_files_os; /* file name restriction ruleset. */ bool restrict_files_ctrl; /* non-zero if control chars in URLs diff -ru wget-1_16_3a_vms/src/url.c wget-1_16_3/src/url.c --- wget-1_16_3a_vms/src/url.c 2015-02-23 09:10:22 -0600 +++ wget-1_16_3/src/url.c 2015-03-31 23:09:48 -0500 @@ -1328,8 +1328,9 @@ enum { filechr_not_unix= 1, /* unusable on Unix, / and \0 */ - filechr_not_windows = 2, /* unusable on Windows, one of \|/?:* */ - filechr_control = 4 /* a control character, e.g. 0-31 */ + filechr_not_vms = 2, /* unusable on VMS (ODS5), 0x00-0x1F * ? */ + filechr_not_windows = 4, /* unusable on Windows, one of \|/?:* */ + filechr_control = 8 /* a control character, e.g. 0-31 */ }; #define FILE_CHAR_TEST(c, mask) \ @@ -1338,11 +1339,14 @@ /* Shorthands for the table: */ #define U filechr_not_unix +#define V filechr_not_vms #define W filechr_not_windows #define C filechr_control +#define UVWC U|V|W|C #define UW U|W -#define UWC U|W|C +#define VC V|C +#define VW V|W /* Table of characters unsafe under various conditions (see above). @@ -1353,22 +1357,22 @@ static const unsigned char filechr_table[256] = { -UWC, C, C, C, C, C, C, C, /* NUL SOH STX ETX EOT ENQ ACK BEL */ - C, C, C, C, C, C, C, C, /* BS HT LF VT FF CR SO SI */ - C, C, C, C, C, C, C, C, /* DLE DC1 DC2 DC3 DC4 NAK SYN ETB */ - C, C, C, C, C, C, C, C, /* CAN EM SUB ESC FS GS RS US */ - 0, 0, W, 0, 0, 0, 0, 0, /* SP ! #$ % ' */ - 0, 0, W, 0, 0, 0, 0, UW, /* ( ) * +, - . / */ - 0, 0, 0, 0, 0, 0, 0, 0, /* 0 1 2 34 5 6 7 */ - 0, 0, W, 0, W, 0, W, W, /* 8 9 : ; = ? */ -
[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file
Follow-up Comment #6, bug #44674 (project wget): Also --debug doesn't show full FORM bodies. ___ Reply to this item at: http://savannah.gnu.org/bugs/?44674 ___ Message sent via/by Savannah http://savannah.gnu.org/
[Bug-wget] [Patch] fix bug #44628 not honoring RFC 6266 in --content-disposition
The problem was that parse_content_disposition treated the values filename and filename* as the same and concatenated both values. RFC 6266 states that if both are present, filename* should have preference. Now, the patch is tricky, because in this case: attachment; filename*=A.ext; filename*0=hello; filename*1=world.ext, the final filename should be A.ext. But modify_param_name in http.c changes the names with '*' to end with the first occurrence of '*', this makes it hard to differentiate both cases. Instead of fixing this and dealing with lots of edge cases I just check for the name to be filename, end with * and have the next character not to be a digit. Do you think this is fine or it's best to get modify_param_name to do it's job better? I also added more unit tests to account for the edge cases. Cheers, Miquel 0001-Fixed-44628-honoring-RFC-6266-content-disposition.patch Description: Binary data
Re: [Bug-wget] gethttp cleanup
Hi Hubert, Hubert Tarasiuk hubert.taras...@gmail.com writes: I have identified a potential drawback with the function `establish_connection`. [Patch #3] On error, it would free the `req` variable, but it never zeroed `*req_ref`. As the matter of fact, it only wrote to `req_ref` on successful exit (when it did not actually change). I suggest that this function never frees the `req` variable, because it never allocates it. (As opposed to `connreq`.) Instead, the caller (`gethttp`) releases the `req` when error occured. [Patch #4] My second idea is to change semantics of `resp_free` and `request_free`, so that they are similar to `xfree`, i.e.: 1) it is safe to call them with a NULL pointer 2) they ensure that the pointer is set to NULL after the call In order to achieve (2), a pointer to a pointer has to be passed. (Please note, that this patch depends on previous.) thanks for your patches, I agree with you and the code looks nicer now. I am going to push them tomorrow if nobody complains before (with the following fix amended to your last patch). diff --git a/src/http.c b/src/http.c index 5338d20..9994d13 100644 --- a/src/http.c +++ b/src/http.c @@ -2506,7 +2506,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, struct url *proxy, using_ssl, inhibit_keep_alive, sock); if (err != RETROK) { -request_free (req); +request_free (req); return err; } } Thanks for your contribution! Giuseppe
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Hi Steven, Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D 1%82%D1%8C This seems to be the correct behavior of a web client. The URL in the GET request is transmitted UTF-8 encoded and percent escaping is performed for chars 127 (not mentioning control chars here). This of course produces a string of gibberish in the resulting mp3 file! This is something different. If you are talking about the file name, well there is --restrict-file-names=nocontrol. Did you give it a try ? Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. From what you write, I am unsure if you are talking about the resulting file name or about HTTP URL encoding in a GET request. Regards, Tim signature.asc Description: This is a digitally signed message part.
[Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C This of course produces a string of gibberish in the resulting mp3 file! Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. Cheers, Stephen
[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file
Follow-up Comment #5, bug #44674 (project wget): Tim: OK please somebody be sure that example is given nearby the --debug option section of the man page. Also it would be good if there was a built-in way to do it in case it is inconvenient to install other programs or do extra input output job starting and waiting on a given system. Anonymous: the --debug part of the man page doesn't say clearly what it will give, also --debug might not be compiled in. And in fact --debug gives more than just the request, and --debug needs one to attempt the request without any --dry-run safety mechanism before going on to the net... ___ Reply to this item at: http://savannah.gnu.org/bugs/?44674 ___ Message sent via/by Savannah http://savannah.gnu.org/
[Bug-wget] [bug #44674] Add an option that will send the HTTP request to stderr or a file
Follow-up Comment #4, bug #44674 (project wget): You can use the --debug flag to show the HTTP request and response headers, including when the traffic is encrypted with SSL. ___ Reply to this item at: http://savannah.gnu.org/bugs/?44674 ___ Message sent via/by Savannah http://savannah.gnu.org/
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Hi Tim, Sorry for the ambiguity. To be more specific, the file name is fine: in the shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 . The audio within the file consists of the Google robot voice reading the string of percent-escaped characters literally, not reading the Russian word. I will try Random Coder's suggestion of a more complete user agent string - apparently http://whatsmyuseragent.com/ is a handy way to find out what your browser claims to be :) On Tue, Mar 31, 2015 at 9:50 PM, Tim Rühsen tim.rueh...@gmx.de wrote: Hi Steven, Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D 1%82%D1%8C This seems to be the correct behavior of a web client. The URL in the GET request is transmitted UTF-8 encoded and percent escaping is performed for chars 127 (not mentioning control chars here). This of course produces a string of gibberish in the resulting mp3 file! This is something different. If you are talking about the file name, well there is --restrict-file-names=nocontrol. Did you give it a try ? Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. From what you write, I am unsure if you are talking about the resulting file name or about HTTP URL encoding in a GET request. Regards, Tim
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
THANK YOU Random Coder! That did the trick. Apparently my earlier attempts were unsuccessful because the problem I was trying to solve was not the problem I actually had :) Specifically I went to whatsmyuseragent.com and my browser id'd as Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36 . I put that, in quotes, instead of just Mozilla as the argument of the -U option, and now I get back an mp3 file with proper Russian audio in it. Much victory.
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
On 01/04/15 00:16, Stephen Wells wrote: Hi Tim, Sorry for the ambiguity. To be more specific, the file name is fine: in the shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 . The audio within the file consists of the Google robot voice reading the string of percent-escaped characters literally, not reading the Russian word. I will try Random Coder's suggestion of a more complete user agent string - apparently http://whatsmyuseragent.com/ is a handy way to find out what your browser claims to be :) I remember google had a parameter for the encoding. It may be worth explicitly noting that it's utf-8, it may be using a fallback based on the User-Agent.
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
On Tue, Mar 31, 2015 at 10:11 AM, Stephen Wells sawells.2...@gmail.com wrote: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; ... http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C This of course produces a string of gibberish in the resulting mp3 file! That URL is correct, it's what you'll see a browser send across the wire for the same string. Google is producing gibberish because of some User-agent sniffing that they appear to be doing. If you change the user agent to something that's more complete, like Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 instead of just Mozilla, it should work correctly.