Re: [Tutor] Remove certain tags in html files

Sebastien Noel Fri, 27 Jul 2007 12:55:58 -0700

Thanks a lot for this.

Someone on the comp.lang.python usenet channel also suggested using 
BeautifulSoup with holding the content of a table for example, 
extracting the table, than putting back the content. Also seems like a 
good idea.


I will look at both possibilities.

Eric Brunson wrote:
>
> Man, the docs on the HTMLParser module are really sparse.
>
> Attached is some code I just whipped out that will parse and HTML 
> file, supress the ouput of the tags you mention and spew the html back 
> out.  It's just a rough thing, you'll still have to read the docs and 
> make sure to expand on some of the things it's doing, but I think 
> it'll handle 95% of what it comes across.  Be sure to override  all 
> the "handle_*()" methods I didn't.
>
> My recommendation would be to shove your HTML through BeautifulSoup to 
> ensure it is well formed, then run it through the html parser to do 
> whatever you want to change it, then through tidy to make it look nice.
>
> If you wanted to take the time, you could probably write the entire 
> tidy process in the parser.  I got a fair ways there, but decided it 
> was too long to be instructional, so I pared it back to what I've 
> included.
>
> Hope this gets you started,
> e.
>
> Eric Brunson wrote:
>> Eric Brunson wrote:
>>  
>>> Sebastien Noel wrote:
>>>      
>>>> Hi,
>>>>
>>>> I'm doing a little script with the help of the BeautifulSoup HTML 
>>>> parser and uTidyLib (HTML Tidy warper for python).
>>>>
>>>> Essentially what it does is fetch all the html files in a given 
>>>> directory (and it's subdirectories) clean the code with Tidy 
>>>> (removes deprecated tags, change the output to be xhtml) and than 
>>>> BeautifulSoup removes a couple of things that I don't want in the 
>>>> files (Because I'm stripping the files to bare bone, just keeping 
>>>> layout information).
>>>>
>>>> Finally, I want to remove all trace of layout tables (because the 
>>>> new layout will be in css for positioning). Now, there is tables to 
>>>> layout things on the page and tables to represent tabular data, but 
>>>> I think it would be too hard to make a script that finds out the 
>>>> difference.
>>>>
>>>> My question, since I'm quite new to python, is about what tool I 
>>>> should use to remove the table, tr and td tags, but not what's 
>>>> enclosed in it. I think BeautifulSoup isn't good for that because 
>>>> it removes what's enclosed as well.
>>>>             
>>> You want to look at htmllib:  
>>> http://docs.python.org/lib/module-htmllib.html
>>>       
>>
>> I'm sorry, I should have pointed you to HTMLParser:  
>> http://docs.python.org/lib/module-HTMLParser.html
>>
>> It's a bit more straightforward than the HTMLParser defined in 
>> htmllib.  Everything I was talking about below pertains to the 
>> HTMLParser module and not the htmllib module.
>>
>>  
>>> If you've used a SAX parser for XML, it's similar.  Your parser 
>>> parses the file and every time it hit a tag, it runs a callback 
>>> which you've defined.  You can assign a default callback that simply 
>>> prints out the tag as parsed, then a custom callback for each tag 
>>> you want to clean up.
>>>
>>> It took me a little time to wrap my head around it the first time I 
>>> used it, but once you "get it" it's *really* powerful and really 
>>> easy to implement.
>>>
>>> Read the docs and play around a little bit, then if you have 
>>> questions, post back and I'll see if I can dig up some examples I've 
>>> written.
>>>
>>> e.
>>>
>>>      
>>>> Is re the good module for that? Basically, if I make an iteration 
>>>> that scans the text and tries to match every occurrence of a given 
>>>> regular expression, would it be a good idea?
>>>>
>>>> Now, I'm quite new to the concept of regular expressions, but would 
>>>> it ressemble something like this: re.compile("<table.*>")?
>>>>
>>>> Thanks for the help.
>>>> _______________________________________________
>>>> Tutor maillist  -  Tutor@python.org
>>>> http://mail.python.org/mailman/listinfo/tutor
>>>>             
>>> _______________________________________________
>>> Tutor maillist  -  Tutor@python.org
>>> http://mail.python.org/mailman/listinfo/tutor
>>>       
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor@python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>   
>

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Remove certain tags in html files

Reply via email to