Great, thanks so much for posting that. It's worked a treat and I'm getting HTML files with the list of h2 tags I was looking for. Here's the code just to share, what a relief :) : ............................... from BeautifulSoup import BeautifulSoup import re
page = open("soup_test/tomatoandcream.html", 'r') soup = BeautifulSoup(page) myTagSearch = str(soup.findAll('h2')) myFile = open('Soup_Results.html', 'w') myFile.write(myTagSearch) myFile.close() del myTagSearch ............................... I do have two other small queries that I wonder if anyone can help with. Firstly, I'm getting the following character: "[" at the start, "]" at the end of the code. Along with "," in between each tag line listing. This seems like normal behaviour but I can't find the way to strip them out. There's an example of stripping comments and I understand the example, but what's the *reference* to the above '[', ']' and ',' elements? for the comma I tried: soup.find(text=",").replaceWith("") but that throws this error: AttributeError: 'NoneType' object has no attribute 'replaceWith' Again working with the 'Removing Elements' example I tried: soup = BeautifulSoup("you are a banana, banana, banana") a = str(",") comments = soup.findAll(text=",") [",".extract() for "," in comments] But if I'm doing 'import beautifulSoup' this give me a "soup = BeautifulSoup("you are a banana, banana, banana") TypeError: 'module' object is not callable" error, "import beautifulSoup from BeautifulSoup" does nothing Secondly, in the above working code that is just pulling the h2 tags - how the blazes do I 'prettify' before writing to the file? Thanks in advance! Mark. .................. On Mar 25, 6:51 pm, Jorge Godoy <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] writes: > > Hi All, > > > Apologies for the newbie question but I've searched and tried all > > sorts for a few days and I'm pulling my hair out ;[ > > > I have a 'reference' HTML file and a 'test' HTML file from which I > > need to pull 10 strings, all of which are contained within <h2> tags, > > e.g.: > > <h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2> > > > Once I've found the 10 I'd like to write them to another 'results' > > html file. Perhaps a 'reference results' and a 'test results' file. > >>From where I would then like to 'diff' the results to see if they > > match. > > > Here's the rub: I cannot find a way to pull those 10 strings so I can > > save them to the results pages. > > Can anyone please suggest how this can be done? > > > I've tried allsorts but I've been learning Python for 1 week and just > > don't know enough to mod example scripts it seems. don't even get me > > started on python docs.. ayaa ;] Please feel free to teach me to suck > > eggs because it's all new to me :) > > > Thanks in advance, > > > Mark. > > Take a look at BeautifulSoup. It is easy to use and works well with some > malformed HTML that you might find ahead. > > -- > Jorge Godoy <[EMAIL PROTECTED]>- Hide quoted text - > > - Show quoted text - -- http://mail.python.org/mailman/listinfo/python-list