I have just started using libxml2dom to read html files and I have some questions I hope you guys can answer me.

The page I am working on (teste.htm):
<html>
 <head>
   <title>
     Title
   </title>
 </head>
 <body bgcolor = 'FFFFF'>
   <table>
     <tr bgcolor="#EEEEEE">
       <td nowrap="nowrap">
<font size="2" face="Tahoma, Arial"> <a name="1375048"></a> </font>
       </td>
       <td nowrap="nowrap">
         <font size="-2" face="Verdana"> 8/15/2009</font>
       </td>
     </tr>
   </table>
 </body>
</html>

>>> import libxml2dom
>>> foo = open('teste.htm', 'r')
>>> str1 = foo.read()
>>> doc = libxml2dom.parseString(str1, html=1)
>>> html = doc.firstChild
>>> html.nodeName
u'html'
>>> head = html.firstChild
>>> head.nodeName
u'head'
>>> title = head.firstChild
>>> title.nodeName
u'title'
>>> body = head.nextSibling
>>> body.nodeName
u'body'
>>> table = body.firstChild
>>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
>>> table = body.firstChild.nextSibling #why this works? is there a text element hidden? (2)
>>> table.nodeName
u'table'
>>> tr = table.firstChild
>>> tr.nodeName
u'tr'
>>> td = tr.firstChild
>>> td.nodeName
u'td'
>>> font = td.firstChild
>>> font.nodeName
u'text' # (1)
>>> font = td.firstChild.nextSibling # (2)
>>> font.nodeName
u'font'
>>> a = font.firstChild
>>> a.nodeName
u'text' #(1)
>>> a = font.firstChild.nextSibling #(2)
>>> a.nodeName
u'a'


It seems like sometimes there are some text elements 'hidden'. This is probably a standard in DOM I simply am not familiar with this and I would very much appreciate if anyone had the kindness to explain me this.

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to