Claudio Grondi wrote: > [EMAIL PROTECTED] wrote: > > Claudio Grondi wrote: > > > >>[EMAIL PROTECTED] wrote: > >> > >>>Here is my script: > >>> > >>>from mechanize import * > >>>from BeautifulSoup import * > >>>import StringIO > >>>b = Browser() > >>>f = b.open("http://www.translate.ru/text.asp?lang=ru") > >>>b.select_form(nr=0) > >>>b["source"] = "hello python" > >>>html = b.submit().get_data() > >>>soup = BeautifulSoup(html) > >>>print soup.find("span", id = "r_text").string > >>> > >>>OUTPUT: > >>>привет > >>>питон > >>>---------- > >>>In russian it looks like: > >>>"привет питон" > >>> > >>>How can I translate this using standard Python libraries?? > >>> > >>>-- > >>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800 > >>> > >> > >>Translate to what and with what purpose? > >> > >>Assuming your intention is to get a Python Unicode string, what about: > >> > >>strHTML = 'привет > >>питон' > >>strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','') > >>strUnicode = eval("u'%s'"%strUnicodeHexCode) > >> > >>? > >> > >>I am sure, there is a more elegant and direct solution, but just wanted > >>to provide here some quick response. > >> > >>Claudio Grondi > > > > > > Thank you, Claudio. > > Really interest solution, but it doesn't work... > > > > In [19]: strHTML = 'привет > > питон' > > > > In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','') > > > > In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode) > > > > In [22]: print strUnicode > > --------------------------------------------------------------------------- > > exceptions.UnicodeEncodeError Traceback (most > > recent call last) > > > > C:\Documents and Settings\dron\<ipython console> > > > > C:\usr\lib\encodings\cp866.py in encode(self, input, errors) > > 16 def encode(self,input,errors='strict'): > > 17 > > ---> 18 return codecs.charmap_encode(input,errors,encoding_map) > > 19 > > 20 def decode(self,input,errors='strict'): > > > > UnicodeEncodeError: 'charmap' codec can't encode characters in position > > 0-5: character maps to <undefined> > > > > In [23]: print strUnicode.encode("utf-8") > > сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ > > <-- it's not my string "привет питон" > > > > In [24]: strUnicode.encode("utf-8") > > Out[24]: > > '\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90 > > \xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\ > > x85' <-- and too many chars > > > Have you considered, that the HTML page specifies charset=windows-1251 > in its > <meta http-equiv="Content-Type" content="text/html; > charset=windows-1251"> tag ? > You are apparently on Linux or so, so I can't track this problem down > having only a Windows box here, but inbetween I know that there is > another problem with it: > I have erronously assumed, that the numbers in п are hexadecimal, > but they are decimal, so it is necessary to do hex(int('1087')) on them > to get at the right code to put into eval(). > As you know now the idea I hope you will succeed as I did with: > > >>> lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';') > >>> lstIntUnicodeDecimalCode > ['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080', > '1090', '1086', '1085', ''] > >>> lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1] > >>> lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode] > >>> lstHexUnicode > ['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438', > '0x442', '0x43e', '0x43d'] > >>> eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) ) > u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d' > >>> strUnicode = eval( > 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) ) > >>> print strUnicode > приветпитон > > Sorry for that mess not taking the space into consideration, but I think > you can get the idea anyway.
I hope he *doesn't* get that "idea". #>>> strHTML = 'приветпит&# 1086;н' #>>> strUnicode = [unichr(int(x)) for x in strHTML.replace('&#','').split(';') if x] #>>> strUnicode [u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442', u'\u043f', u' \u0438', u'\u0442', u'\u043e', u'\u043d'] #>>> -- http://mail.python.org/mailman/listinfo/python-list