Re: [Tutor] Problem using lxml

Stefan Behnel Sun, 23 Aug 2015 01:12:39 -0700

Anthony Papillion schrieb am 23.08.2015 um 01:16:
> from lxml import html
> import requests
> 
> page = requests.get("http://joplin.craigslist.org/search/w4m";)
> tree = html.fromstring(page.text)


While requests has its merits, this can be simplified to

    tree = html.parse("http://joplin.craigslist.org/search/w4m";)


> titles = tree.xpath('//a[@class="hdrlnk"]/text()')
> try:
>     for title in titles:
>         print title

This only works as long as the link tags only contain plain text, no other
tags, because "text()" selects individual text nodes in XPath. Also, using
@class="hdrlnk" will not match link tags that use class="  hdrlnk  " or
class="abc hdrlnk other".

If you want to be on the safe side, I'd use cssselect instead and then
serialise the complete text content of the link tag to a string, i.e.

    from lxml.etree import tostring

    for link_element in tree.cssselect("a.hdrlnk"):
        title = tostring(
            link_element,
            method="text", encoding="unicode", with_tail=False)
        print(title.strip())

Note that the "cssselect()" feature requires the external "cssselect"
package to be installed. "pip install cssselect" should handle that.


> except:
>     pass

Oh, and bare "except:" clauses are generally frowned upon because they can
easily hide bugs by also catching unexpected exceptions. Better be explicit
about the exception type(s) you want to catch.

Stefan


_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problem using lxml

Reply via email to