Thanks a lot for this. Someone on the comp.lang.python usenet channel also suggested using BeautifulSoup with holding the content of a table for example, extracting the table, than putting back the content. Also seems like a good idea.
I will look at both possibilities. Eric Brunson wrote: > > Man, the docs on the HTMLParser module are really sparse. > > Attached is some code I just whipped out that will parse and HTML > file, supress the ouput of the tags you mention and spew the html back > out. It's just a rough thing, you'll still have to read the docs and > make sure to expand on some of the things it's doing, but I think > it'll handle 95% of what it comes across. Be sure to override all > the "handle_*()" methods I didn't. > > My recommendation would be to shove your HTML through BeautifulSoup to > ensure it is well formed, then run it through the html parser to do > whatever you want to change it, then through tidy to make it look nice. > > If you wanted to take the time, you could probably write the entire > tidy process in the parser. I got a fair ways there, but decided it > was too long to be instructional, so I pared it back to what I've > included. > > Hope this gets you started, > e. > > Eric Brunson wrote: >> Eric Brunson wrote: >> >>> Sebastien Noel wrote: >>> >>>> Hi, >>>> >>>> I'm doing a little script with the help of the BeautifulSoup HTML >>>> parser and uTidyLib (HTML Tidy warper for python). >>>> >>>> Essentially what it does is fetch all the html files in a given >>>> directory (and it's subdirectories) clean the code with Tidy >>>> (removes deprecated tags, change the output to be xhtml) and than >>>> BeautifulSoup removes a couple of things that I don't want in the >>>> files (Because I'm stripping the files to bare bone, just keeping >>>> layout information). >>>> >>>> Finally, I want to remove all trace of layout tables (because the >>>> new layout will be in css for positioning). Now, there is tables to >>>> layout things on the page and tables to represent tabular data, but >>>> I think it would be too hard to make a script that finds out the >>>> difference. >>>> >>>> My question, since I'm quite new to python, is about what tool I >>>> should use to remove the table, tr and td tags, but not what's >>>> enclosed in it. I think BeautifulSoup isn't good for that because >>>> it removes what's enclosed as well. >>>> >>> You want to look at htmllib: >>> http://docs.python.org/lib/module-htmllib.html >>> >> >> I'm sorry, I should have pointed you to HTMLParser: >> http://docs.python.org/lib/module-HTMLParser.html >> >> It's a bit more straightforward than the HTMLParser defined in >> htmllib. Everything I was talking about below pertains to the >> HTMLParser module and not the htmllib module. >> >> >>> If you've used a SAX parser for XML, it's similar. Your parser >>> parses the file and every time it hit a tag, it runs a callback >>> which you've defined. You can assign a default callback that simply >>> prints out the tag as parsed, then a custom callback for each tag >>> you want to clean up. >>> >>> It took me a little time to wrap my head around it the first time I >>> used it, but once you "get it" it's *really* powerful and really >>> easy to implement. >>> >>> Read the docs and play around a little bit, then if you have >>> questions, post back and I'll see if I can dig up some examples I've >>> written. >>> >>> e. >>> >>> >>>> Is re the good module for that? Basically, if I make an iteration >>>> that scans the text and tries to match every occurrence of a given >>>> regular expression, would it be a good idea? >>>> >>>> Now, I'm quite new to the concept of regular expressions, but would >>>> it ressemble something like this: re.compile("<table.*>")? >>>> >>>> Thanks for the help. >>>> _______________________________________________ >>>> Tutor maillist - Tutor@python.org >>>> http://mail.python.org/mailman/listinfo/tutor >>>> >>> _______________________________________________ >>> Tutor maillist - Tutor@python.org >>> http://mail.python.org/mailman/listinfo/tutor >>> >> >> _______________________________________________ >> Tutor maillist - Tutor@python.org >> http://mail.python.org/mailman/listinfo/tutor >> > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor