[EMAIL PROTECTED] wrote:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!

> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on.  Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to