Re: How to find to HTML strings and 'save' them?

John Nagle Sun, 25 Mar 2007 16:31:12 -0800

[EMAIL PROTECTED] wrote:
> Great, thanks so much for posting that. It's worked a treat and I'm
> getting HTML files with the list of h2 tags I was looking for. Here's
> the code just to share, what a relief :)   :
> ...............................
> from BeautifulSoup import BeautifulSoup
> import re
> 
> page = open("soup_test/tomatoandcream.html", 'r')
> soup = BeautifulSoup(page)
> 
> myTagSearch = str(soup.findAll('h2'))
> 
> myFile = open('Soup_Results.html', 'w')
> myFile.write(myTagSearch)
> myFile.close()
> 
> del myTagSearch
> ...............................
> 
> I do have two other small queries that I wonder if anyone can help
> with.
> 
> Firstly, I'm getting the following character: "[" at the start, "]" at
> the end of the code. Along with "," in between each tag line listing.
> This seems like normal behaviour but I can't find the way to strip
> them out.


Ah.  What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags :     # for each H2 tag
     texts = htag.findAll(text=True) # find all text items within this h2
     s = ' '.join(texts).strip()        + '\n'  # combine text items into clean 
string
     myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to find to HTML strings and 'save' them?

Reply via email to