Sebastien Noel wrote: > Hi, > > I'm doing a little script with the help of the BeautifulSoup HTML parser > and uTidyLib (HTML Tidy warper for python). > > Essentially what it does is fetch all the html files in a given > directory (and it's subdirectories) clean the code with Tidy (removes > deprecated tags, change the output to be xhtml) and than BeautifulSoup > removes a couple of things that I don't want in the files (Because I'm > stripping the files to bare bone, just keeping layout information). > > Finally, I want to remove all trace of layout tables (because the new > layout will be in css for positioning). Now, there is tables to layout > things on the page and tables to represent tabular data, but I think it > would be too hard to make a script that finds out the difference. > > My question, since I'm quite new to python, is about what tool I should > use to remove the table, tr and td tags, but not what's enclosed in it. > I think BeautifulSoup isn't good for that because it removes what's > enclosed as well. >
You want to look at htmllib: http://docs.python.org/lib/module-htmllib.html If you've used a SAX parser for XML, it's similar. Your parser parses the file and every time it hit a tag, it runs a callback which you've defined. You can assign a default callback that simply prints out the tag as parsed, then a custom callback for each tag you want to clean up. It took me a little time to wrap my head around it the first time I used it, but once you "get it" it's *really* powerful and really easy to implement. Read the docs and play around a little bit, then if you have questions, post back and I'll see if I can dig up some examples I've written. e. > Is re the good module for that? Basically, if I make an iteration that > scans the text and tries to match every occurrence of a given regular > expression, would it be a good idea? > > Now, I'm quite new to the concept of regular expressions, but would it > ressemble something like this: re.compile("<table.*>")? > > Thanks for the help. > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor