I got it working with a utf-8 query by adding an Accept-Charset header to the request. I used the 'Tamper Data' add-on to Firefox to view all the request headers being sent by the browser. I added all the same headers to the Python request and it worked. Then I took out the headers until I found the needed one. Here is a stripped-down version of your code that posts a word encoded in utf-8 and gets the correct response. I also changed the post parameters a little to match what I am seeing in my browser:
import re, urllib, urllib2 __where = [ re.compile(r'name=\"q\">([^<]*)'), re.compile(r'td bgcolor=white>([^<]*)'), re.compile(r'td bgcolor=white class=s><div style=padding:10px;>([^<]*)'), re.compile(r'<\/strong><br>([^<]*)') ] phrase = 'ent\xc3\xa3o' params = urllib.urlencode( { 'doit' : 'done', 'tt' : 'urltext', 'trtext' : phrase, 'intl' : 1, 'lp' : 'pt_en' } ) print "URL encoding ", params req = urllib2.Request('http://world.altavista.com/babelfish/tr') req.add_header('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7') response = urllib2.urlopen(req, params) html = response.read() for regex in __where: match = regex.search(html) if match: print match.group(1) break else: print "ERROR MATCHING" print html Kent Kent Johnson wrote: > OK this is actually starting to make sense :-) Here is what I think is > happening: > > You get different results in the IDE and the console because they are using > different encodings. The IDE is using utf-8 so the params are encoded in > utf-8. The console is using latin-1 and you get encoded latin-1 params. > > When you use babelfish from the browser it gets a page in utf-8 and sends the > parameters back the same way, but probably with a header saying it is utf-8. > When you use urllib you don't tell it the encoding so it is assuming latin-1, > that's why the interpreter version works. > > So in your GUI version if you get utf-8 from the GUI, you can convert it to > latin-1 by > phrase.decode('utf-8').encode('latin-1') as long as your text can be > expressed in latin-1. If you need utf-8 then you have to figure out how to > tell babelfish that you are sending utf-8. > > Kent > > PS please reply to the list not to me personally. > > Jorge Louis de Castro wrote: > >>Thanks again, >> >>I'm sorry to be such a PITB but this is driving me insane! the code >>below easily connects to babelfish and returns a translated string. >> >>__where = [ re.compile(r'name=\"q\">([^<]*)'), >> re.compile(r'td bgcolor=white>([^<]*)'), >> re.compile(r'td bgcolor=white class=s><div >>style=padding:10px;>([^<]*)'), >> re.compile(r'<\/strong><br>([^<]*)') >> >>def clean(text): >> return ' '.join(string.replace(text.strip(), "\n", ' ').split()) >> >>def translateByCode(phrase, from_code, to_code): >> phrase = clean(phrase) >> params = urllib.urlencode( { 'BabelFishFrontPage' : 'yes', >> 'doit' : 'done', >> 'urltext' : phrase, >> 'lp' : from_code + '_' + to_code } ) >> print "URL encoding ", params >> try: >> response = >>urllib.urlopen('http://world.altavista.com/babelfish/tr', params) >> except IOError, what: >> print "ERRROR TRANSLATING ", what >> except: >> print "Unexpected error:", sys.exc_info()[0] >> >> html = response.read() >> for regex in __where: >> match = regex.search(html) >> if match: break >> if not match: print "ERROR MATCHING" >> return clean(match.group(1)) >> >>if __name__ == '__main__': >> print translateByCode('então', 'pt', 'en') >> >>If I run this through the Run option on the IDE I get the following output: >> >>URL encoding doit=done&urltext=ent%C3%A3o&BabelFishFrontPage=yes&lp=pt_en >>então >>então >> >>If I import this module on the interpreter and then call >> >>print translateByCode('então', 'en', 'pt') >> >>I get: >> >>URL encoding doit=done&urltext=ent%E3o&BabelFishFrontPage=yes&lp=pt_en >>then >>then >> >>Now the urllib encoding of the urltext IS different ("ent%C3%A3o" VS >>"ent%E3o") even though I'm passing the same stuff! >>And this works fine except when I use special characters and I don't >>know how to use the utf-8 encoding to get this working -i know altavista >>uses utf-8 because they also translate chinese. >> >>Thanks again and sorry for the blurb but i ran out of solutions for this >>one. >> >> >> > > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor