Re: [Tutor] Remove certain tags in html files

Eric Brunson Fri, 27 Jul 2007 12:35:52 -0700


Man, the docs on the HTMLParser module are really sparse.

Attached is some code I just whipped out that will parse and HTML file,supress the ouput of the tags you mention and spew the html back out.It's just a rough thing, you'll still have to read the docs and makesure to expand on some of the things it's doing, but I think it'llhandle 95% of what it comes across. Be sure to override all the"handle_*()" methods I didn't.

My recommendation would be to shove your HTML through BeautifulSoup toensure it is well formed, then run it through the html parser to dowhatever you want to change it, then through tidy to make it look nice.

If you wanted to take the time, you could probably write the entire tidyprocess in the parser. I got a fair ways there, but decided it was toolong to be instructional, so I pared it back to what I've included.


Hope this gets you started,
e.

Eric Brunson wrote:

Eric Brunson wrote:
Sebastien Noel wrote:
Hi,
I'm doing a little script with the help of the BeautifulSoup HTML parserand uTidyLib (HTML Tidy warper for python).
Essentially what it does is fetch all the html files in a givendirectory (and it's subdirectories) clean the code with Tidy (removesdeprecated tags, change the output to be xhtml) and than BeautifulSoupremoves a couple of things that I don't want in the files (Because I'mstripping the files to bare bone, just keeping layout information).
Finally, I want to remove all trace of layout tables (because the newlayout will be in css for positioning). Now, there is tables to layoutthings on the page and tables to represent tabular data, but I think itwould be too hard to make a script that finds out the difference.
My question, since I'm quite new to python, is about what tool I shoulduse to remove the table, tr and td tags, but not what's enclosed in it.I think BeautifulSoup isn't good for that because it removes what'senclosed as well.
You want to look at htmllib:  http://docs.python.org/lib/module-htmllib.html
I'm sorry, I should have pointed you to HTMLParser:http://docs.python.org/lib/module-HTMLParser.html
It's a bit more straightforward than the HTMLParser defined in htmllib.Everything I was talking about below pertains to the HTMLParser moduleand not the htmllib module.
If you've used a SAX parser for XML, it's similar. Your parser parsesthe file and every time it hit a tag, it runs a callback which you'vedefined. You can assign a default callback that simply prints out thetag as parsed, then a custom callback for each tag you want to clean up.
It took me a little time to wrap my head around it the first time I usedit, but once you "get it" it's *really* powerful and really easy toimplement.
Read the docs and play around a little bit, then if you have questions,post back and I'll see if I can dig up some examples I've written.
e.
Is re the good module for that? Basically, if I make an iteration thatscans the text and tries to match every occurrence of a given regularexpression, would it be a good idea?
Now, I'm quite new to the concept of regular expressions, but would itressemble something like this: re.compile("<table.*>")?
Thanks for the help.
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

from HTMLParser import HTMLParser

class MyParser( HTMLParser ):
    def __init__( self ):
        HTMLParser.__init__( self )
        self.supress = ( 'table', 'tr', 'td' )
            
    def handle_starttag( self, tag, attrs ):
        if tag not in self.supress:
            print "<%s%s%s>" % ( tag,
                                 " " if attrs else "",
                                 " ".join( "%s='%s'" % pair for pair in attrs ) ),

    def handle_data( self, data ):
        print data,

    def handle_endtag( self, tag ):
        if tag not in self.supress:
            print "</%s>" % ( tag, ),

MyParser().feed( open( 'index.html' ).read() )

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Remove certain tags in html files

Reply via email to