In article <[EMAIL PROTECTED]>,
 "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
> 
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
> 
> I can't validate it and xml.minidom.dom.parseString won't work on it.
> 
> If this was just some teenager's web site I'd move on.  Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has 
already written the ugly regex parsing hacks for you. Try Connelly 
Barnes' HTMLData: 
http://oregonstate.edu/~barnesc/htmldata/ 

Or BeautifulSoup as others have suggested.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to