Marc Tompkins, 06.06.2012 10:21: > On Tue, Jun 5, 2012 at 11:22 PM, Stefan Behnel wrote: > >> You can do this: >> >> connection = urllib2.urlopen(url) >> tree = etree.parse(connection, my_html_parser) >> >> Alternatively, use fromstring() to parse from strings: >> >> page = urllib2.urlopen(url) >> pagecontents = page.read() >> html_root = etree.fromstring(pagecontents, my_html_parser) >> >> > Thank you! fromstring() did the trick for me. > > Interestingly, your first suggestion - parsing straight from the connection > without an intermediate read() - appears to create the tree successfully, > but my first strip_tags() fails, with the error "ValueError: Input object > has no document: lxml.etree._ElementTree".
Weird. You may want to check the parser error log to see if it has any hint. >> See the lxml tutorial. > > I did - I've been consulting it religiously - but I missed the fact that I > was mixing strings with file-like IO, and (as you mentioned) the error > message really wasn't helping me figure out my problem. Yes, I think it could do better here. Reporting a parser error with an "unprintable error message" would at least make it less likely that users are being diverted from the actual cause of the problem. >> Also note that there's lxml.html, which provides an >> extended tool set for HTML processing. > > I've been using lxml.etree because I'm used to the syntax, and because > (perhaps mistakenly) I was under the impression that its parser was more > resilient in the face of broken HTML - this page has unclosed tags all over > the place. Both are using the same parser and share most of their API. lxml.html is mostly just an extension to lxml.etree with special HTML tools. Stefan _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor