Hello, > I have a large text file (1GB or so) with structure similar to the > html example below. > > I have to extract content (text between div and tr tags) from this > file and put it into a spreadsheet or a database - given my limited > python knowledge I was going to try to do this with regex pattern > matching. > > Would someone be able to provide pointers regarding how do I approach > this? Any code samples would be greatly appreciated. The ultimate tool for handling HTML is http://www.crummy.com/software/BeautifulSoup/ where you can do stuff like: soup = BeautifulSoup(html) for div in soup("div", {"class" : "special"}): ...
Not sure how fast it is though. There is also the htmllib module that comes with python, it might do the work as well and maybe a bit faster. If the file is valid HTML and you need some speed, have a look at xml.sax. HTH, -- Miki <[EMAIL PROTECTED]> http://pythonwise.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list