Man, the docs on the HTMLParser module are really sparse.

Attached is some code I just whipped out that will parse and HTML file, supress the ouput of the tags you mention and spew the html back out. It's just a rough thing, you'll still have to read the docs and make sure to expand on some of the things it's doing, but I think it'll handle 95% of what it comes across. Be sure to override all the "handle_*()" methods I didn't.

My recommendation would be to shove your HTML through BeautifulSoup to ensure it is well formed, then run it through the html parser to do whatever you want to change it, then through tidy to make it look nice.

If you wanted to take the time, you could probably write the entire tidy process in the parser. I got a fair ways there, but decided it was too long to be instructional, so I pared it back to what I've included.

Hope this gets you started,
e.

Eric Brunson wrote:
Eric Brunson wrote:
Sebastien Noel wrote:
Hi,

I'm doing a little script with the help of the BeautifulSoup HTML parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given directory (and it's subdirectories) clean the code with Tidy (removes deprecated tags, change the output to be xhtml) and than BeautifulSoup removes a couple of things that I don't want in the files (Because I'm stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new layout will be in css for positioning). Now, there is tables to layout things on the page and tables to represent tabular data, but I think it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I should use to remove the table, tr and td tags, but not what's enclosed in it. I think BeautifulSoup isn't good for that because it removes what's enclosed as well.
You want to look at htmllib:  http://docs.python.org/lib/module-htmllib.html

I'm sorry, I should have pointed you to HTMLParser: http://docs.python.org/lib/module-HTMLParser.html

It's a bit more straightforward than the HTMLParser defined in htmllib. Everything I was talking about below pertains to the HTMLParser module and not the htmllib module.

If you've used a SAX parser for XML, it's similar. Your parser parses the file and every time it hit a tag, it runs a callback which you've defined. You can assign a default callback that simply prints out the tag as parsed, then a custom callback for each tag you want to clean up.

It took me a little time to wrap my head around it the first time I used it, but once you "get it" it's *really* powerful and really easy to implement.

Read the docs and play around a little bit, then if you have questions, post back and I'll see if I can dig up some examples I've written.

e.

Is re the good module for that? Basically, if I make an iteration that scans the text and tries to match every occurrence of a given regular expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it ressemble something like this: re.compile("<table.*>")?

Thanks for the help.
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

from HTMLParser import HTMLParser

class MyParser( HTMLParser ):
    def __init__( self ):
        HTMLParser.__init__( self )
        self.supress = ( 'table', 'tr', 'td' )
            
    def handle_starttag( self, tag, attrs ):
        if tag not in self.supress:
            print "<%s%s%s>" % ( tag,
                                 " " if attrs else "",
                                 " ".join( "%s='%s'" % pair for pair in attrs ) ),

    def handle_data( self, data ):
        print data,

    def handle_endtag( self, tag ):
        if tag not in self.supress:
            print "</%s>" % ( tag, ),

MyParser().feed( open( 'index.html' ).read() )

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to