Re: extract news article from web

Steve Holden Thu, 23 Dec 2004 07:05:05 -0800

Zhang Le wrote:

Thanks for the hint. The xml-rpc service is great, but I want some
general techniques to parse news information in the usual html pages.

Currently I'm looking at a script-based approach found at:
http://www.namo.com/products/handstory/manual/hsceditor/
User can write some simple template to extract certain fields from a
web page. Unfortunately, it is not open source, so I can not look
inside the blackbox.:-(

Zhang Le

That's a very large topic, and not one that I could claim to be expert on, so let's hope that others will pitch in with their favorite techniques. Otherwise it's down to providing individual parsers for each service you want to scan, and maintaining the parsers as each group of designers modifies their pages.

You might want to look at BeutifulSoup, which is a module for extracting stuff from (possibly) irregularly-formed HTML.

regards
 Steve
--
Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list

Re: extract news article from web

Reply via email to