Re: Help with libxml2dom
On 19 Aug, 13:55, Nuno Santos nuno.hespan...@gmail.com wrote: I have just started using libxml2dom to read html files and I have some questions I hope you guys can answer me. [...] table = body.firstChild table.nodeName u'text' #?! Why!? Shouldn't it be a table? (1) You answer this yourself just below. table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2) table.nodeName u'table' Yes, in the DOM, the child nodes of elements include text nodes, and even though one might regard the whitespace before the first child element and that appearing after the last child element as unimportant, the DOM keeps it around in case it really is important. [...] It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this. Well, the nodes are actually there: they're whitespace used to provide the indentation in your example. I recommend using XPath to get actual elements: table = body.xpath(*)[0] # get child elements and then select the first Although people make a big song and dance about the DOM being a nasty API, it's quite bearable if you use it together with XPath queries. Paul -- http://mail.python.org/mailman/listinfo/python-list
Help with libxml2dom
I have just started using libxml2dom to read html files and I have some questions I hope you guys can answer me. The page I am working on (teste.htm): html head title Title /title /head body bgcolor = 'F' table tr bgcolor=#EE td nowrap=nowrap font size=2 face=Tahoma, Arial a name=1375048/a /font /td td nowrap=nowrap font size=-2 face=Verdana 8/15/2009/font /td /tr /table /body /html import libxml2dom foo = open('teste.htm', 'r') str1 = foo.read() doc = libxml2dom.parseString(str1, html=1) html = doc.firstChild html.nodeName u'html' head = html.firstChild head.nodeName u'head' title = head.firstChild title.nodeName u'title' body = head.nextSibling body.nodeName u'body' table = body.firstChild table.nodeName u'text' #?! Why!? Shouldn't it be a table? (1) table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2) table.nodeName u'table' tr = table.firstChild tr.nodeName u'tr' td = tr.firstChild td.nodeName u'td' font = td.firstChild font.nodeName u'text' # (1) font = td.firstChild.nextSibling # (2) font.nodeName u'font' a = font.firstChild a.nodeName u'text' #(1) a = font.firstChild.nextSibling #(2) a.nodeName u'a' It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Help with libxml2dom
Nuno Santos wrote: I have just started using libxml2dom to read html files and I have some questions I hope you guys can answer me. The page I am working on (teste.htm): html head title Title /title /head body bgcolor = 'F' table tr bgcolor=#EE td nowrap=nowrap font size=2 face=Tahoma, Arial a name=1375048/a /font /td td nowrap=nowrap font size=-2 face=Verdana 8/15/2009/font /td /tr /table /body /html import libxml2dom foo = open('teste.htm', 'r') str1 = foo.read() doc = libxml2dom.parseString(str1, html=1) html = doc.firstChild html.nodeName u'html' head = html.firstChild head.nodeName u'head' title = head.firstChild title.nodeName u'title' body = head.nextSibling body.nodeName u'body' table = body.firstChild table.nodeName u'text' #?! Why!? Shouldn't it be a table? (1) table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2) table.nodeName u'table' tr = table.firstChild tr.nodeName u'tr' td = tr.firstChild td.nodeName u'td' font = td.firstChild font.nodeName u'text' # (1) font = td.firstChild.nextSibling # (2) font.nodeName u'font' a = font.firstChild a.nodeName u'text' #(1) a = font.firstChild.nextSibling #(2) a.nodeName u'a' It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this. Without a schema or something similar, a parser can't tell if whitespace is significant or not. So if you have root child/ /root you will have not 2, but 4 nodes - root, text containing a newline + 2 spaces, child, and again a text with a newline. You have to skip over those that you are not interested in, or use a different XML-library such as ElementTree (e.g. in the form of lxml) that has a different approach about text-nodes. Diez -- http://mail.python.org/mailman/listinfo/python-list