On Sun, Apr 14, 2013 at 10:29 AM, <[email protected]> wrote: > Hi all, > > I am trying to crawl the information from this link > > http://muaban.net/mua-ban-nha-quan-thu-duc-l5924-c32/quan-thu-duc-ban-nha1lau-2mt-truoc-sau-dg-ng-cong-tru-p-hiep-phu-q9-dt-4x21-5m--id15946781 > > and this is the code I use > >> link = >> "http://muaban.net/mua-ban-nha-quan-thu-duc-l5924-c32/quan-thu-duc-ban-nha1lau-2mt-truoc-sau-dg-ng-cong-tru-p-hiep-phu-q9-dt-4x21-5m--id15946781" >> xPath = "id('pC_DV_tableHeader')/x:tbody/x:tr[4]/x:td[3]" >> namespace = {'x': 'http://www.w3.org/1999/xhtml'} >> >> tree = lxml.html.parse(link) >> arrayContent = tree.xpath(xPath + "/text()", namespaces=namespace) >> >> if len(arrayContent): >> content = cgi.escape(arrayContent[0].encode("utf-8")) > > > I use xPath checker add-on of firefox to read the xPath value and the > namespace. However, when running the code, I always get the content empty. > How can I solve this ? >
Are you sure your xpath is correct? I'm not sure about that "id()" syntax. Try: //x:table[@id="'pC_DV_tableHeader"]//x:tr[4]/x:td[3] Another thing to note, the DOM presented by Firefox is the result of Firefox parsing and potentially fixing up the HTML code. For instance, there is no <tbody> in the actual HTML for that table, Firefox always inserts a <tbody> if it is missing when parsing a table. Does lxml also insert a <tbody> if there is not one? If it doesn't, then your xpath would never work. Cheers Tom -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/django-users?hl=en. For more options, visit https://groups.google.com/groups/opt_out.

