Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Tim Rühsen
Hi Steven,

Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells:
 Dear all - I am currently trying to use wget to obtain mp3 files from the
 Google Translate TTS system. In principle this can be done using:
 
 wget -U Mozilla -O ${string}.mp3 
 http://translate.google.com/translate_tts?tl=TLq=${string};
 
 where TL is a twoletter language code (en,fr,de and so on).
 
 However I am meeting a serious error when I try to send Russian strings
 (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
 Cygwin) and the file system will display the cyrillic strings no problem.
 If I provide a command like this:
 
 http://translate.google.com/translate_tts?tl=ruq=мазать
 
 wget incorrectly processes the Cyrillic characters _before_ sending the
 http request, so what it actually requests is:
 
 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D
 1%82%D1%8C

This seems to be the correct behavior of a web client.
The URL in the GET request is transmitted UTF-8 encoded and percent escaping 
is performed for chars 127 (not mentioning control chars here).

 This of course produces a string of gibberish in the resulting mp3 file!

This is something different. If you are talking about the file name, well 
there is --restrict-file-names=nocontrol. Did you give it a try ?

 Is there any way to make wget actually send the string it is given, instead
 of mangling it on the way out? This is really blocking me.

From what you write, I am unsure if you are talking about the resulting file 
name or about HTTP URL encoding in a GET request.

Regards, Tim


signature.asc
Description: This is a digitally signed message part.


[Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
Dear all - I am currently trying to use wget to obtain mp3 files from the
Google Translate TTS system. In principle this can be done using:

wget -U Mozilla -O ${string}.mp3 
http://translate.google.com/translate_tts?tl=TLq=${string};

where TL is a twoletter language code (en,fr,de and so on).

However I am meeting a serious error when I try to send Russian strings
(tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
Cygwin) and the file system will display the cyrillic strings no problem.
If I provide a command like this:

http://translate.google.com/translate_tts?tl=ruq=мазать

wget incorrectly processes the Cyrillic characters _before_ sending the
http request, so what it actually requests is:


http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C

This of course produces a string of gibberish in the resulting mp3 file!

Is there any way to make wget actually send the string it is given, instead
of mangling it on the way out? This is really blocking me.

Cheers,
Stephen


Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
Hi Tim,

Sorry for the ambiguity. To be more specific, the file name is fine: in the
shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 .
The audio within the file consists of the Google robot voice reading the
string of percent-escaped characters literally, not reading the Russian
word.

I will try Random Coder's suggestion of a more complete user agent string -
 apparently http://whatsmyuseragent.com/ is a handy way to find out what
your browser claims to be :)

On Tue, Mar 31, 2015 at 9:50 PM, Tim Rühsen tim.rueh...@gmx.de wrote:

 Hi Steven,

 Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells:
  Dear all - I am currently trying to use wget to obtain mp3 files from the
  Google Translate TTS system. In principle this can be done using:
 
  wget -U Mozilla -O ${string}.mp3 
  http://translate.google.com/translate_tts?tl=TLq=${string};
 
  where TL is a twoletter language code (en,fr,de and so on).
 
  However I am meeting a serious error when I try to send Russian strings
  (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under
  Cygwin) and the file system will display the cyrillic strings no problem.
  If I provide a command like this:
 
  http://translate.google.com/translate_tts?tl=ruq=мазать
 
  wget incorrectly processes the Cyrillic characters _before_ sending the
  http request, so what it actually requests is:
 
 
 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D
  1%82%D1%8C

 This seems to be the correct behavior of a web client.
 The URL in the GET request is transmitted UTF-8 encoded and percent
 escaping
 is performed for chars 127 (not mentioning control chars here).

  This of course produces a string of gibberish in the resulting mp3 file!

 This is something different. If you are talking about the file name, well
 there is --restrict-file-names=nocontrol. Did you give it a try ?

  Is there any way to make wget actually send the string it is given,
 instead
  of mangling it on the way out? This is really blocking me.

 From what you write, I am unsure if you are talking about the resulting
 file
 name or about HTTP URL encoding in a GET request.

 Regards, Tim



Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Stephen Wells
THANK YOU Random Coder! That did the trick. Apparently my earlier attempts
were unsuccessful because the problem I was trying to solve was not the
problem I actually had :)

Specifically I went to whatsmyuseragent.com and my browser id'd as
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2272.101 Safari/537.36 . I put that, in quotes, instead of
just Mozilla as the argument of the -U option, and now I get back an mp3
file with proper Russian audio in it. Much victory.


Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Ángel González

On 01/04/15 00:16, Stephen Wells wrote:

Hi Tim,

Sorry for the ambiguity. To be more specific, the file name is fine: in the
shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 .
The audio within the file consists of the Google robot voice reading the
string of percent-escaped characters literally, not reading the Russian
word.

I will try Random Coder's suggestion of a more complete user agent string -
  apparently http://whatsmyuseragent.com/ is a handy way to find out what
your browser claims to be :)


I remember google had a parameter for the encoding. It may be worth 
explicitly noting that it's utf-8, it may be using a fallback based on 
the User-Agent.





Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?

2015-03-31 Thread Random Coder
On Tue, Mar 31, 2015 at 10:11 AM, Stephen Wells sawells.2...@gmail.com wrote:
 Dear all - I am currently trying to use wget to obtain mp3 files from the
 Google Translate TTS system. In principle this can be done using:

 wget -U Mozilla -O ${string}.mp3 
 http://translate.google.com/translate_tts?tl=TLq=${string};

 ...

 http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C

 This of course produces a string of gibberish in the resulting mp3 file!


That URL is correct, it's what you'll see a browser send across the
wire for the same string.  Google is producing gibberish because of
some User-agent sniffing that they appear to be doing.

If you change the user agent to something that's more complete, like
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36 instead of just Mozilla, it should
work correctly.