On Sun, Apr 14, 2013 at 10:29 AM,  <[email protected]> wrote:
> Hi all,
>
> I am trying to crawl the information from this link
>
> http://muaban.net/mua-ban-nha-quan-thu-duc-l5924-c32/quan-thu-duc-ban-nha1lau-2mt-truoc-sau-dg-ng-cong-tru-p-hiep-phu-q9-dt-4x21-5m--id15946781
>
> and this is the code I use
>
>> link =
>> "http://muaban.net/mua-ban-nha-quan-thu-duc-l5924-c32/quan-thu-duc-ban-nha1lau-2mt-truoc-sau-dg-ng-cong-tru-p-hiep-phu-q9-dt-4x21-5m--id15946781";
>> xPath =  "id('pC_DV_tableHeader')/x:tbody/x:tr[4]/x:td[3]"
>> namespace = {'x': 'http://www.w3.org/1999/xhtml'}
>>
>> tree = lxml.html.parse(link)
>> arrayContent = tree.xpath(xPath + "/text()", namespaces=namespace)
>>
>> if len(arrayContent):
>>      content = cgi.escape(arrayContent[0].encode("utf-8"))
>
>
> I use xPath checker add-on of firefox to read the xPath value and the
> namespace. However, when running the code, I always get the content empty.
> How can I solve this ?
>

Are you sure your xpath is correct? I'm not sure about that "id()" syntax. Try:

//x:table[@id="'pC_DV_tableHeader"]//x:tr[4]/x:td[3]

Another thing to note, the DOM presented by Firefox is the result of
Firefox parsing and potentially fixing up the HTML code. For instance,
there is no <tbody> in the actual HTML for that table, Firefox always
inserts a <tbody> if it is missing  when parsing a table. Does lxml
also insert a <tbody> if there is not one? If it doesn't, then your
xpath would never work.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to