Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Hi Steven, Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D 1%82%D1%8C This seems to be the correct behavior of a web client. The URL in the GET request is transmitted UTF-8 encoded and percent escaping is performed for chars 127 (not mentioning control chars here). This of course produces a string of gibberish in the resulting mp3 file! This is something different. If you are talking about the file name, well there is --restrict-file-names=nocontrol. Did you give it a try ? Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. From what you write, I am unsure if you are talking about the resulting file name or about HTTP URL encoding in a GET request. Regards, Tim signature.asc Description: This is a digitally signed message part.
[Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C This of course produces a string of gibberish in the resulting mp3 file! Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. Cheers, Stephen
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
Hi Tim, Sorry for the ambiguity. To be more specific, the file name is fine: in the shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 . The audio within the file consists of the Google robot voice reading the string of percent-escaped characters literally, not reading the Russian word. I will try Random Coder's suggestion of a more complete user agent string - apparently http://whatsmyuseragent.com/ is a handy way to find out what your browser claims to be :) On Tue, Mar 31, 2015 at 9:50 PM, Tim Rühsen tim.rueh...@gmx.de wrote: Hi Steven, Am Dienstag, 31. März 2015, 18:11:58 schrieb Stephen Wells: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; where TL is a twoletter language code (en,fr,de and so on). However I am meeting a serious error when I try to send Russian strings (tl=ru) in Cyrillic characters. I'm working in a UTF-8 environment (under Cygwin) and the file system will display the cyrillic strings no problem. If I provide a command like this: http://translate.google.com/translate_tts?tl=ruq=мазать wget incorrectly processes the Cyrillic characters _before_ sending the http request, so what it actually requests is: http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D 1%82%D1%8C This seems to be the correct behavior of a web client. The URL in the GET request is transmitted UTF-8 encoded and percent escaping is performed for chars 127 (not mentioning control chars here). This of course produces a string of gibberish in the resulting mp3 file! This is something different. If you are talking about the file name, well there is --restrict-file-names=nocontrol. Did you give it a try ? Is there any way to make wget actually send the string it is given, instead of mangling it on the way out? This is really blocking me. From what you write, I am unsure if you are talking about the resulting file name or about HTTP URL encoding in a GET request. Regards, Tim
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
THANK YOU Random Coder! That did the trick. Apparently my earlier attempts were unsuccessful because the problem I was trying to solve was not the problem I actually had :) Specifically I went to whatsmyuseragent.com and my browser id'd as Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36 . I put that, in quotes, instead of just Mozilla as the argument of the -U option, and now I get back an mp3 file with proper Russian audio in it. Much victory.
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
On 01/04/15 00:16, Stephen Wells wrote: Hi Tim, Sorry for the ambiguity. To be more specific, the file name is fine: in the shell script the file name $*.mp3 expands correctly to e.g. мазать.mp3 . The audio within the file consists of the Google robot voice reading the string of percent-escaped characters literally, not reading the Russian word. I will try Random Coder's suggestion of a more complete user agent string - apparently http://whatsmyuseragent.com/ is a handy way to find out what your browser claims to be :) I remember google had a parameter for the encoding. It may be worth explicitly noting that it's utf-8, it may be using a fallback based on the User-Agent.
Re: [Bug-wget] Incorrect handling of Cyrillic characters in http request - any workaround?
On Tue, Mar 31, 2015 at 10:11 AM, Stephen Wells sawells.2...@gmail.com wrote: Dear all - I am currently trying to use wget to obtain mp3 files from the Google Translate TTS system. In principle this can be done using: wget -U Mozilla -O ${string}.mp3 http://translate.google.com/translate_tts?tl=TLq=${string}; ... http://translate.google.com/translate_tts?tl=ruq=%D0%BC%D0%B0%D0%B7%D0%B0%D1%82%D1%8C This of course produces a string of gibberish in the resulting mp3 file! That URL is correct, it's what you'll see a browser send across the wire for the same string. Google is producing gibberish because of some User-agent sniffing that they appear to be doing. If you change the user agent to something that's more complete, like Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36 instead of just Mozilla, it should work correctly.