Re: Html character entity conversion

2006-09-09 Thread yichun
[EMAIL PROTECTED] wrote:
 danielx wrote:
 [EMAIL PROTECTED] wrote:
 Here is my script:

 from mechanize import *
 from BeautifulSoup import *
 import StringIO
 b = Browser()
 f = b.open(http://www.translate.ru/text.asp?lang=ru;)
 b.select_form(nr=0)
 b[source] = hello python
 html = b.submit().get_data()
 soup = BeautifulSoup(html)
 print  soup.find(span, id = r_text).string

 OUTPUT:
 #1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;
 --
 In russian it looks like:
 привет питон

 How can I translate this using standard Python libraries??

 --
 
 Thank you for response.
 It doesn't matter what is 'BeautifulSoup'...

However, the best solution is to ask BeautifulSoup to do that for you. 
if you do

soup = BeautifulSoup(your_html_page, convertEntities=html)

you should not be worrying about the problem you had. this converts all 
the html entities (the five you see as soup.entitydefs) and all the 
#xxx; stuff to their python unicode string.

yichun


 General question is:
 
 How can I convert encoded string
 
 sEncodedHtmlText = '#1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;'
 
 into human readable:
 
 sDecodedHtmlText  == 'привет питон'
 


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-08-01 Thread Anthra Norell
Pak (or Andrei, whichever is your first name),

  My proposal below:


- Original Message -
From: [EMAIL PROTECTED]
Newsgroups: comp.lang.python
To: python-list@python.org
Sent: Sunday, July 30, 2006 8:52 PM
Subject: Re: Html character entity conversion


 danielx wrote:
  [EMAIL PROTECTED] wrote:
   Here is my script:
  
   from mechanize import *
   from BeautifulSoup import *
   import StringIO
   b = Browser()
   f = b.open(http://www.translate.ru/text.asp?lang=ru;)
   b.select_form(nr=0)
   b[source] = hello python
   html = b.submit().get_data()
   soup = BeautifulSoup(html)
   print  soup.find(span, id = r_text).string
  
   OUTPUT:
   #1087;#1088;#1080;#1074;#1077;#1090;
   #1087;#1080;#1090;#1086;#1085;
   --
   In russian it looks like:
   привет питон
  
   How can I translate this using standard Python libraries??
  
   --
   Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
 

I've been proposing solutions of late using a stream editor I recently wrote, 
realizing each time how well it works in a vareity of
different situations. I can only hope I am not beginning to get on people's 
nerves (Here he comes again with his damn thing!).
  I base the following on proposals others have made so far, because I 
haven't used unicodes and know little about them. If
nothing else, I do think this is a rather elegant way to translate the 
ampersands to the unicode stirngs. Having to read them
through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't 
assign a unicode string to a variable so that it
would print text as Claudio proposed.


Here is my htm example:

 htm = StringIO.StringIO ('''
htm
  !-- Examen --
  headtitleDeuxiegrave;me question/title/head
body bgcolor=#beb4a0 text=#82 etc. 
  bLacute;eacute;legrave;ve doit lire et 
traduire:/bnbsp;#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;br
/body
/htm ''')

And here is my SE hack:

 import SE# Available at the Cheese Shop
 Ampersand_Filter = SE.SE (' EAT ~#[0-9]+;~==(10) ')
 for line in htm:
line = line [:-1]
ampersand_codes = Ampersand_Filter (line [:-1])
# A list of the ampersand codes found in the current line
if ampersand_codes:
   # From it we edit the substitution defintiions for the current 
line
   substitutions = ''
   for code in ampersand_codes.split ('\n')[:-1]:
  substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int 
(code [2:-1]))
  # And make a custom Editor just for the current line
  Line_Unicoder = SE.SE (substitutions)
  unicode_line = Line_Unicoder (line)
   print eval ('u%s' % unicode_line)
else:
   print line

htm
  !-- Examen --
  headtitleDeuxiegrave;me question/title/head
body bgcolor=#beb4a0 text=#82 etc. 
  bLacute;eacute;legrave;ve doit lire et traduire:/bnbsp;привет 
питонbr
/body
/htm

This is a text book example of dynamic substitutions. Typically SE compiles 
static substituions lists. But with 2**16 (?) unicodes,
building a static list would be absurd if at all possible. So we dynamically 
make custom substitutions for each line after
extracting the ampersand escapes that may be there.

Next we would like to fix the regular ascii ampersand escapes and also strip 
the tags. That is a simple question of preprocessing
the file.

 Legibilizer = SE.SE ('htm2iso.se ~(.|\n)*?~= ~!--(.|\n)*?--~= ')

'htm2iso.se' is a substitutions definition file that defines the standard ascii 
ampersands to characters. It is included in the SE
package. You can name as many definition files as you want. In a definition 
string the name of a file is equivalent to its contents.

 htm.seek (0)
 htm_no_tags = Legibilizer (htm.read ())
 for line in htm_no_tags.split ('\n'):
 if line.strip () == '': continue
 ampersand_codes = Ampersand_Filter (line)
 ... (same as above)

  Deuxième question
  L'élève doit lire et traduire: привет питон


Whether this serves your purpose I don't really know. How you can use it other 
than read it in the IDLE window, I don't know
either.I tried to copy it out, but it doesn't survive the operation and the 
paste has question marks or squares in the place of the
Russian letters.

Regards

Frederic


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-08-01 Thread Claudio Grondi
Anthra Norell wrote:

import SE# Available at the Cheese Shop

I mean, that OP requested:
   'How can I translate this using standard Python libraries??'

so it's just only not on topic.

Claudio Grondi
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Html character entity conversion

2006-08-01 Thread Duncan Booth
[EMAIL PROTECTED] wrote:

 How can I convert encoded string
 
 sEncodedHtmlText = '#1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;'
 
 into human readable:
 
 sDecodedHtmlText  == 'привет питон'

How about:

 sEncodedHtmlText = 'text: 
#1087;#1088;#1080;#1074;#1077;#1090;#1087;#1080;#1090;#1086;#108
5;'
 def unescape(m):
return unichr(int(m.group(0)[2:-1]))

 print re.sub('#[0-9]+;', unescape, sEncodedHtmlText)
text: ???

I'm afraid my newsreader couldn't cope with either your original text or my 
output, but I think this gives the string you wanted. You probably also 
ought to decode sEncodedHtmlText to unicode first otherwise anything which 
isn't an entity escape will be converted to unicode using the default ascii 
encoding.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Html character entity conversion

2006-08-01 Thread Anthra Norell

- Original Message - 
From: Claudio Grondi [EMAIL PROTECTED]
Newsgroups: comp.lang.python
To: python-list@python.org
Sent: Tuesday, August 01, 2006 2:42 PM
Subject: Re: Html character entity conversion


 Anthra Norell wrote:
 
 import SE# Available at the Cheese Shop
 
 I mean, that OP requested:
'How can I translate this using standard Python libraries??'
 
 so it's just only not on topic.
 
 Claudio Grondi
 -- 
 http://mail.python.org/mailman/listinfo/python-list

Claudio,

I was hoping to do the OP a service. Are you also hoping to do him a service?

Frederic


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Html character entity conversion

2006-07-31 Thread Claudio Grondi
John Machin wrote:
 Claudio Grondi wrote:
 
[EMAIL PROTECTED] wrote:

Claudio Grondi wrote:


[EMAIL PROTECTED] wrote:


Here is my script:


from mechanize import *
from BeautifulSoup import *

import StringIO
b = Browser()
f = b.open(http://www.translate.ru/text.asp?lang=ru;)
b.select_form(nr=0)
b[source] = hello python
html = b.submit().get_data()
soup = BeautifulSoup(html)
print  soup.find(span, id = r_text).string

OUTPUT:
#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;
--
In russian it looks like:
привет питон

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800


Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;'
strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
strUnicode = eval(u'%s'%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi


Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;'

In [20]: strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')

In [21]: strUnicode = eval(u'%s'%strUnicodeHexCode)

In [22]: print strUnicode
---
exceptions.UnicodeEncodeErrorTraceback (most
recent call last)

C:\Documents and Settings\dron\ipython console

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
 16 def encode(self,input,errors='strict'):
 17
--- 18 return codecs.charmap_encode(input,errors,encoding_map)
 19
 20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to undefined

In [23]: print strUnicode.encode(utf-8)
сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
-- it's not my string привет питон

In [24]: strUnicode.encode(utf-8)
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
x85' -- and too many chars


Have you considered, that the HTML page specifies charset=windows-1251
in its
meta http-equiv=Content-Type content=text/html;
charset=windows-1251 tag ?
You are apparently on Linux or so, so I can't track this problem down
having only a Windows box here, but inbetween I know that there is
another problem with it:
I have erronously assumed, that the numbers in #1087; are hexadecimal,
but they are decimal, so it is necessary to do hex(int('1087')) on them
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

  lstIntUnicodeDecimalCode = strHTML.replace('#','').split(';')
  lstIntUnicodeDecimalCode
['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
'1090', '1086', '1085', '']
  lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
  lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
  lstHexUnicode
['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
'0x442', '0x43e', '0x43d']
  eval( 'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
  strUnicode = eval(
'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
  print strUnicode
приветпитон

Sorry for that mess not taking the space into consideration, but I think
  you can get the idea anyway.
 
 
 I hope he *doesn't* get that idea.
 
 # strHTML =
 '#1087;#1088;#1080;#1074;#1077;#1090;#1087;#1080;#1090;#
 1086;#1085;'
 # strUnicode = [unichr(int(x)) for x in
 strHTML.replace('#','').split(';') if
  x]
 # strUnicode
 [u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
 u'\u043f', u'
 \u0438', u'\u0442', u'\u043e', u'\u043d']
 #
Knowing about the built-in function unichr() is a good thing, but ... 
there are still drawbacks, because (not tested!) e.g. :
'100x hallo Python' translates to
'100x #1087;#1088;#1080;#1074;#1077;#1090; 
#1055;#1080;#1090;#1086;#1085;'
and can't be handled by improving the core idea by usage of unichr() 
instead of the eval() stuff because of the wrong approach with using 
.replace() and .split() which work only on the given example but not in 
general case.
I am just too lazy to sit down and work on code extracting from the HTML 
the #; sequences to convert only them letting the other content of 
the string unchanged in order to arrive at a solution that works in 
general case (it should be not hard and I suppose the OP has it already 
:-) if he is at a Python skill level of playing around with the 
mechanize module).
I am still convinced, that there must be a more elegant and direct 
solution, so the subject is still fully open for improvements towards 
the actual final goal.
I 

Re: Html character entity conversion

2006-07-30 Thread Claudio Grondi
[EMAIL PROTECTED] wrote:
 Here is my script:
 
 from mechanize import *
 from BeautifulSoup import *
 import StringIO
 b = Browser()
 f = b.open(http://www.translate.ru/text.asp?lang=ru;)
 b.select_form(nr=0)
 b[source] = hello python
 html = b.submit().get_data()
 soup = BeautifulSoup(html)
 print  soup.find(span, id = r_text).string
 
 OUTPUT:
 #1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;
 --
 In russian it looks like:
 привет питон
 
 How can I translate this using standard Python libraries??
 
 --
 Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
 
Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = '#1087;#1088;#1080;#1074;#1077;#1090; 
#1087;#1080;#1090;#1086;#1085;'
strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
strUnicode = eval(u'%s'%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted 
to provide here some quick response.

Claudio Grondi
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread danielx
[EMAIL PROTECTED] wrote:
 Here is my script:

 from mechanize import *
 from BeautifulSoup import *
 import StringIO
 b = Browser()
 f = b.open(http://www.translate.ru/text.asp?lang=ru;)
 b.select_form(nr=0)
 b[source] = hello python
 html = b.submit().get_data()
 soup = BeautifulSoup(html)
 print  soup.find(span, id = r_text).string

 OUTPUT:
 #1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;
 --
 In russian it looks like:
 привет питон

 How can I translate this using standard Python libraries??

 --
 Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

I'm having trouble understanding how your script works (what would a
BeautifulSoup function do?), but assuming your intent is to find
character reference objects in an html document, you might try using
the HTMLParser class in the HTMLParser module. This class delegates
several methods. One of them is handle_charref. It will be called with
one argument, the name of the reference, which includes only the number
part. HTMLParser is alot more powerful than that though. There may be
something more light-weight out there that will accomplish what you
want. Then again, you might be able to find a use for all that power :P.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread [EMAIL PROTECTED]

Claudio Grondi wrote:
 [EMAIL PROTECTED] wrote:
  Here is my script:
 
  from mechanize import *
  from BeautifulSoup import *
  import StringIO
  b = Browser()
  f = b.open(http://www.translate.ru/text.asp?lang=ru;)
  b.select_form(nr=0)
  b[source] = hello python
  html = b.submit().get_data()
  soup = BeautifulSoup(html)
  print  soup.find(span, id = r_text).string
 
  OUTPUT:
  #1087;#1088;#1080;#1074;#1077;#1090;
  #1087;#1080;#1090;#1086;#1085;
  --
  In russian it looks like:
  привет питон
 
  How can I translate this using standard Python libraries??
 
  --
  Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
 
 Translate to what and with what purpose?

 Assuming your intention is to get a Python Unicode string, what about:

 strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;'
 strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
 strUnicode = eval(u'%s'%strUnicodeHexCode)

 ?

 I am sure, there is a more elegant and direct solution, but just wanted
 to provide here some quick response.

 Claudio Grondi

Thank you, Claudio.
Really interest solution, but it doesn't work...

In [19]: strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;'

In [20]: strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')

In [21]: strUnicode = eval(u'%s'%strUnicodeHexCode)

In [22]: print strUnicode
---
exceptions.UnicodeEncodeErrorTraceback (most
recent call last)

C:\Documents and Settings\dron\ipython console

C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
 16 def encode(self,input,errors='strict'):
 17
--- 18 return codecs.charmap_encode(input,errors,encoding_map)
 19
 20 def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-5: character maps to undefined

In [23]: print strUnicode.encode(utf-8)
сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
-- it's not my string привет питон

In [24]: strUnicode.encode(utf-8)
Out[24]:
'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
x85' -- and too many chars

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread [EMAIL PROTECTED]
danielx wrote:
 [EMAIL PROTECTED] wrote:
  Here is my script:
 
  from mechanize import *
  from BeautifulSoup import *
  import StringIO
  b = Browser()
  f = b.open(http://www.translate.ru/text.asp?lang=ru;)
  b.select_form(nr=0)
  b[source] = hello python
  html = b.submit().get_data()
  soup = BeautifulSoup(html)
  print  soup.find(span, id = r_text).string
 
  OUTPUT:
  #1087;#1088;#1080;#1074;#1077;#1090;
  #1087;#1080;#1090;#1086;#1085;
  --
  In russian it looks like:
  привет питон
 
  How can I translate this using standard Python libraries??
 
  --
  Pak Andrei, http://paxoblog.blogspot.com, icq://97449800

 I'm having trouble understanding how your script works (what would a
 BeautifulSoup function do?), but assuming your intent is to find
 character reference objects in an html document, you might try using
 the HTMLParser class in the HTMLParser module. This class delegates
 several methods. One of them is handle_charref. It will be called with
 one argument, the name of the reference, which includes only the number
 part. HTMLParser is alot more powerful than that though. There may be
 something more light-weight out there that will accomplish what you
 want. Then again, you might be able to find a use for all that power :P.

Thank you for response.
It doesn't matter what is 'BeautifulSoup'...
General question is:

How can I convert encoded string

sEncodedHtmlText = '#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;'

into human readable:

sDecodedHtmlText  == 'привет питон'

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread Marc 'BlackJack' Rintsch
In [EMAIL PROTECTED],
[EMAIL PROTECTED] wrote:

 Here is my script:
 
 from mechanize import *
 from BeautifulSoup import *
 import StringIO
 b = Browser()
 f = b.open(http://www.translate.ru/text.asp?lang=ru;)
 b.select_form(nr=0)
 b[source] = hello python
 html = b.submit().get_data()
 soup = BeautifulSoup(html)
 print  soup.find(span, id = r_text).string
 
 OUTPUT:
 #1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;
 --
 In russian it looks like:
 привет питон
 
 How can I translate this using standard Python libraries??

Have you tried a more recent version of BeautifulSoup?  IIRC current
versions always decode text to unicode objects before returning them.

Ciao,
Marc 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread Claudio Grondi
[EMAIL PROTECTED] wrote:
 Claudio Grondi wrote:
 
[EMAIL PROTECTED] wrote:

Here is my script:

from mechanize import *
from BeautifulSoup import *
import StringIO
b = Browser()
f = b.open(http://www.translate.ru/text.asp?lang=ru;)
b.select_form(nr=0)
b[source] = hello python
html = b.submit().get_data()
soup = BeautifulSoup(html)
print  soup.find(span, id = r_text).string

OUTPUT:
#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;
--
In russian it looks like:
привет питон

How can I translate this using standard Python libraries??

--
Pak Andrei, http://paxoblog.blogspot.com, icq://97449800


Translate to what and with what purpose?

Assuming your intention is to get a Python Unicode string, what about:

strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
#1087;#1080;#1090;#1086;#1085;'
strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
strUnicode = eval(u'%s'%strUnicodeHexCode)

?

I am sure, there is a more elegant and direct solution, but just wanted
to provide here some quick response.

Claudio Grondi
 
 
 Thank you, Claudio.
 Really interest solution, but it doesn't work...
 
 In [19]: strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;'
 
 In [20]: strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
 
 In [21]: strUnicode = eval(u'%s'%strUnicodeHexCode)
 
 In [22]: print strUnicode
 ---
 exceptions.UnicodeEncodeErrorTraceback (most
 recent call last)
 
 C:\Documents and Settings\dron\ipython console
 
 C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
  16 def encode(self,input,errors='strict'):
  17
 --- 18 return codecs.charmap_encode(input,errors,encoding_map)
  19
  20 def decode(self,input,errors='strict'):
 
 UnicodeEncodeError: 'charmap' codec can't encode characters in position
 0-5: character maps to undefined
 
 In [23]: print strUnicode.encode(utf-8)
 сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
 -- it's not my string привет питон
 
 In [24]: strUnicode.encode(utf-8)
 Out[24]:
 '\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
 \xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
 x85' -- and too many chars
 
Have you considered, that the HTML page specifies charset=windows-1251 
in its
meta http-equiv=Content-Type content=text/html; 
charset=windows-1251 tag ?
You are apparently on Linux or so, so I can't track this problem down 
having only a Windows box here, but inbetween I know that there is 
another problem with it:
I have erronously assumed, that the numbers in #1087; are hexadecimal, 
but they are decimal, so it is necessary to do hex(int('1087')) on them 
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

  lstIntUnicodeDecimalCode = strHTML.replace('#','').split(';')
  lstIntUnicodeDecimalCode
['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080', 
'1090', '1086', '1085', '']
  lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
  lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
  lstHexUnicode
['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438', 
'0x442', '0x43e', '0x43d']
  eval( 'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
  strUnicode = eval( 
'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
  print strUnicode
приветпитон

Sorry for that mess not taking the space into consideration, but I think 
  you can get the idea anyway.

Claudio Grondi
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

2006-07-30 Thread John Machin
Claudio Grondi wrote:
 [EMAIL PROTECTED] wrote:
  Claudio Grondi wrote:
 
 [EMAIL PROTECTED] wrote:
 
 Here is my script:
 
 from mechanize import *
 from BeautifulSoup import *
 import StringIO
 b = Browser()
 f = b.open(http://www.translate.ru/text.asp?lang=ru;)
 b.select_form(nr=0)
 b[source] = hello python
 html = b.submit().get_data()
 soup = BeautifulSoup(html)
 print  soup.find(span, id = r_text).string
 
 OUTPUT:
 #1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;
 --
 In russian it looks like:
 привет питон
 
 How can I translate this using standard Python libraries??
 
 --
 Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
 
 
 Translate to what and with what purpose?
 
 Assuming your intention is to get a Python Unicode string, what about:
 
 strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
 #1087;#1080;#1090;#1086;#1085;'
 strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
 strUnicode = eval(u'%s'%strUnicodeHexCode)
 
 ?
 
 I am sure, there is a more elegant and direct solution, but just wanted
 to provide here some quick response.
 
 Claudio Grondi
 
 
  Thank you, Claudio.
  Really interest solution, but it doesn't work...
 
  In [19]: strHTML = '#1087;#1088;#1080;#1074;#1077;#1090;
  #1087;#1080;#1090;#1086;#1085;'
 
  In [20]: strUnicodeHexCode = strHTML.replace('#','\u').replace(';','')
 
  In [21]: strUnicode = eval(u'%s'%strUnicodeHexCode)
 
  In [22]: print strUnicode
  ---
  exceptions.UnicodeEncodeErrorTraceback (most
  recent call last)
 
  C:\Documents and Settings\dron\ipython console
 
  C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
   16 def encode(self,input,errors='strict'):
   17
  --- 18 return codecs.charmap_encode(input,errors,encoding_map)
   19
   20 def decode(self,input,errors='strict'):
 
  UnicodeEncodeError: 'charmap' codec can't encode characters in position
  0-5: character maps to undefined
 
  In [23]: print strUnicode.encode(utf-8)
  сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
  -- it's not my string привет питон
 
  In [24]: strUnicode.encode(utf-8)
  Out[24]:
  '\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
  \xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
  x85' -- and too many chars
 
 Have you considered, that the HTML page specifies charset=windows-1251
 in its
 meta http-equiv=Content-Type content=text/html;
 charset=windows-1251 tag ?
 You are apparently on Linux or so, so I can't track this problem down
 having only a Windows box here, but inbetween I know that there is
 another problem with it:
 I have erronously assumed, that the numbers in #1087; are hexadecimal,
 but they are decimal, so it is necessary to do hex(int('1087')) on them
 to get at the right code to put into eval().
 As you know now the idea I hope you will succeed as I did with:

   lstIntUnicodeDecimalCode = strHTML.replace('#','').split(';')
   lstIntUnicodeDecimalCode
 ['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
 '1090', '1086', '1085', '']
   lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
   lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
   lstHexUnicode
 ['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
 '0x442', '0x43e', '0x43d']
   eval( 'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
 u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
   strUnicode = eval(
 'u%s'%''.join(lstHexUnicode).replace('0x','\u0' ) )
   print strUnicode
 приветпитон

 Sorry for that mess not taking the space into consideration, but I think
   you can get the idea anyway.

I hope he *doesn't* get that idea.

# strHTML =
'#1087;#1088;#1080;#1074;#1077;#1090;#1087;#1080;#1090;#
1086;#1085;'
# strUnicode = [unichr(int(x)) for x in
strHTML.replace('#','').split(';') if
 x]
# strUnicode
[u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
u'\u043f', u'
\u0438', u'\u0442', u'\u043e', u'\u043d']
#

-- 
http://mail.python.org/mailman/listinfo/python-list