Re: Html character entity conversion

Anthra Norell Tue, 01 Aug 2006 01:13:03 -0700

Pak (or Andrei, whichever is your first name),

      My proposal below:



----- Original Message -----
From: <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: <python-list@python.org>
Sent: Sunday, July 30, 2006 8:52 PM
Subject: Re: Html character entity conversion


> danielx wrote:
> > [EMAIL PROTECTED] wrote:
> > > Here is my script:
> > >
> > > from mechanize import *
> > > from BeautifulSoup import *
> > > import StringIO
> > > b = Browser()
> > > f = b.open("http://www.translate.ru/text.asp?lang=ru";)
> > > b.select_form(nr=0)
> > > b["source"] = "hello python"
> > > html = b.submit().get_data()
> > > soup = BeautifulSoup(html)
> > > print  soup.find("span", id = "r_text").string
> > >
> > > OUTPUT:
> > > &#1087;&#1088;&#1080;&#1074;&#1077;&#1090;
> > > &#1087;&#1080;&#1090;&#1086;&#1085;
> > > ----------
> > > In russian it looks like:
> > > "привет питон"
> > >
> > > How can I translate this using standard Python libraries??
> > >
> > > --
> > > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
> >

I've been proposing solutions of late using a stream editor I recently wrote, 
realizing each time how well it works in a vareity of
different situations. I can only hope I am not beginning to get on people's 
nerves (Here he comes again with his damn thing!).
      I base the following on proposals others have made so far, because I 
haven't used unicodes and know little about them. If
nothing else, I do think this is a rather elegant way to translate the 
ampersands to the unicode stirngs. Having to read them
through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't 
assign a unicode string to a variable so that it
would print text as Claudio proposed.


Here is my htm example:

>>> htm = StringIO.StringIO ('''
<htm>
  <!-- Examen -->
  <head><title>Deuxi&egrave;me question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
      <b>L&acute;&eacute;l&egrave;ve doit lire et 
traduire:</b>&nbsp;&#1087;&#1088;&#1080;&#1074;&#1077;&#1090;
&#1087;&#1080;&#1090;&#1086;&#1085;<br>
    </body>
</htm> ''')

And here is my SE hack:

>>> import SE    # Available at the Cheese Shop
>>> Ampersand_Filter = SE.SE (' <EAT> "~&#[0-9]+;~==(10)" ')
>>> for line in htm:
            line = line [:-1]
            ampersand_codes = Ampersand_Filter (line [:-1])
            # A list of the ampersand codes found in the current line
            if ampersand_codes:
               # From it we edit the substitution defintiions for the current 
line
               substitutions = ''
               for code in ampersand_codes.split ('\n')[:-1]:
                  substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int 
(code [2:-1]))
                  # And make a custom Editor just for the current line
                  Line_Unicoder = SE.SE (substitutions)
                  unicode_line = Line_Unicoder (line)
               print eval ('u"%s"' % unicode_line)
            else:
               print line

<htm>
  <!-- Examen -->
  <head><title>Deuxi&egrave;me question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
      <b>L&acute;&eacute;l&egrave;ve doit lire et traduire:</b>&nbsp;привет 
питон<br>
    </body>
</htm>

This is a text book example of dynamic substitutions. Typically SE compiles 
static substituions lists. But with 2**16 (?) unicodes,
building a static list would be absurd if at all possible. So we dynamically 
make custom substitutions for each line after
extracting the ampersand escapes that may be there.

Next we would like to fix the regular ascii ampersand escapes and also strip 
the tags. That is a simple question of preprocessing
the file.

>>> Legibilizer = SE.SE ('htm2iso.se "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ')

'htm2iso.se' is a substitutions definition file that defines the standard ascii 
ampersands to characters. It is included in the SE
package. You can name as many definition files as you want. In a definition 
string the name of a file is equivalent to its contents.

>>> htm.seek (0)
>>> htm_no_tags = Legibilizer (htm.read ())
>>> for line in htm_no_tags.split ('\n'):
         if line.strip () == '': continue
         ampersand_codes = Ampersand_Filter (line)
         ... (same as above)

  Deuxième question
      L'élève doit lire et traduire: привет питон


Whether this serves your purpose I don't really know. How you can use it other 
than read it in the IDLE window, I don't know
either.I tried to copy it out, but it doesn't survive the operation and the 
paste has question marks or squares in the place of the
Russian letters.

Regards

Frederic


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html character entity conversion

Reply via email to