BeautifulSoup is the standard response. I think lxml will not work very well unless the html is extremely nicely formatted, but I could be wrong.
For what you describe I would suggest developing seat-of-the-pants heuristics -- just get the page using httplib and then use string.find liberally. I've had at least three consulting gigs solving this problems using various techniques and the general problem is quite difficult, but if you are trying to parse just a few pages in simple ways developing special purpose heuristics is pretty easy (until they redesign the pages, which they will do every so often). Best of luck, -- Aaron Watters btw: If you have lots of money to spend on this my former client connotate.com does this sort of scraping (and I developed some of the code). --- On Mon, 2/21/11, James Mills <prolo...@shortcircuit.net.au> wrote: From: James Mills <prolo...@shortcircuit.net.au> Subject: Re: [Web-SIG] Extracting web data To: "web-sig" <web-sig@python.org> Date: Monday, February 21, 2011, 7:07 PM On Mon, Feb 21, 2011 at 2:21 PM, Deb Midya <debmi...@yahoo.com> wrote: Hi Python web-sig users, Thanks in advance and I am new to web-sig. I am using Python 2.6 on Windows XP. May I request you to assist me for the following please. I like to extract web data from the site (http://finance.yahoo.com, for example). The data may include Historical Prices, Key Statistics, News & Info, Headlines, etc. for a list of codes (such WOW, .... these are codes for company Ids). I am trying to automate the extraction of data. Is there any Python module or any assistance please? Once again, thank you very much for the time you have given. You might want to look into using eitherthe lxml or BeautifulSoup modules. cheersJames -- -- James Mills -- -- "Problems are solved by method" -----Inline Attachment Follows----- _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/arw1961%40yahoo.com
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com