Re: Help with libxml2dom

2009-08-19 Thread Paul Boddie
On 19 Aug, 13:55, Nuno Santos nuno.hespan...@gmail.com wrote:
 I have just started using libxml2dom to read html files and I have some
 questions I hope you guys can answer me.

[...]

   table = body.firstChild
   table.nodeName
 u'text' #?! Why!? Shouldn't it be a table? (1)

You answer this yourself just below.

   table = body.firstChild.nextSibling #why this works? is there a
 text element hidden? (2)
   table.nodeName
 u'table'

Yes, in the DOM, the child nodes of elements include text nodes, and
even though one might regard the whitespace before the first child
element and that appearing after the last child element as
unimportant, the DOM keeps it around in case it really is important.

[...]

 It seems like sometimes there are some text elements 'hidden'. This is
 probably a standard in DOM I simply am not familiar with this and I
 would very much appreciate if anyone had the kindness to explain me this.

Well, the nodes are actually there: they're whitespace used to provide
the indentation in your example. I recommend using XPath to get actual
elements:

table = body.xpath(*)[0] # get child elements and then select the
first

Although people make a big song and dance about the DOM being a
nasty API, it's quite bearable if you use it together with XPath
queries.

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Help with libxml2dom

2009-08-19 Thread Nuno Santos
I have just started using libxml2dom to read html files and I have some 
questions I hope you guys can answer me.


The page I am working on (teste.htm):
html
 head
   title
 Title
   /title
 /head
 body bgcolor = 'F'
   table
 tr bgcolor=#EE
   td nowrap=nowrap
 font size=2 face=Tahoma, Arial a name=1375048/a 
/font

   /td
   td nowrap=nowrap
 font size=-2 face=Verdana 8/15/2009/font
   /td
 /tr
   /table
 /body
/html

 import libxml2dom
 foo = open('teste.htm', 'r')
 str1 = foo.read()
 doc = libxml2dom.parseString(str1, html=1)
 html = doc.firstChild
 html.nodeName
u'html'
 head = html.firstChild
 head.nodeName
u'head'
 title = head.firstChild
 title.nodeName
u'title'
 body = head.nextSibling
 body.nodeName
u'body'
 table = body.firstChild
 table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)
 table = body.firstChild.nextSibling #why this works? is there a 
text element hidden? (2)

 table.nodeName
u'table'
 tr = table.firstChild
 tr.nodeName
u'tr'
 td = tr.firstChild
 td.nodeName
u'td'
 font = td.firstChild
 font.nodeName
u'text' # (1)
 font = td.firstChild.nextSibling # (2)
 font.nodeName
u'font'
 a = font.firstChild
 a.nodeName
u'text' #(1)
 a = font.firstChild.nextSibling #(2)
 a.nodeName
u'a'


It seems like sometimes there are some text elements 'hidden'. This is 
probably a standard in DOM I simply am not familiar with this and I 
would very much appreciate if anyone had the kindness to explain me this.


Thanks.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Help with libxml2dom

2009-08-19 Thread Diez B. Roggisch
Nuno Santos wrote:

 I have just started using libxml2dom to read html files and I have some
 questions I hope you guys can answer me.
 
 The page I am working on (teste.htm):
 html
   head
 title
   Title
 /title
   /head
   body bgcolor = 'F'
 table
   tr bgcolor=#EE
 td nowrap=nowrap
   font size=2 face=Tahoma, Arial a name=1375048/a
 /font
 /td
 td nowrap=nowrap
   font size=-2 face=Verdana 8/15/2009/font
 /td
   /tr
 /table
   /body
 /html
 
   import libxml2dom
   foo = open('teste.htm', 'r')
   str1 = foo.read()
   doc = libxml2dom.parseString(str1, html=1)
   html = doc.firstChild
   html.nodeName
 u'html'
   head = html.firstChild
   head.nodeName
 u'head'
   title = head.firstChild
   title.nodeName
 u'title'
   body = head.nextSibling
   body.nodeName
 u'body'
   table = body.firstChild
   table.nodeName
 u'text' #?! Why!? Shouldn't it be a table? (1)
   table = body.firstChild.nextSibling #why this works? is there a
 text element hidden? (2)
   table.nodeName
 u'table'
   tr = table.firstChild
   tr.nodeName
 u'tr'
   td = tr.firstChild
   td.nodeName
 u'td'
   font = td.firstChild
   font.nodeName
 u'text' # (1)
   font = td.firstChild.nextSibling # (2)
   font.nodeName
 u'font'
   a = font.firstChild
   a.nodeName
 u'text' #(1)
   a = font.firstChild.nextSibling #(2)
   a.nodeName
 u'a'
 
 
 It seems like sometimes there are some text elements 'hidden'. This is
 probably a standard in DOM I simply am not familiar with this and I
 would very much appreciate if anyone had the kindness to explain me this.

Without a schema or something similar, a parser can't tell if whitespace is
significant or not. So if you have 

root
  child/
/root

you will have not 2, but 4 nodes - root, text containing a newline + 2
spaces, child, and again a text with a newline.

You have to skip over those that you are not interested in, or use a
different XML-library such as ElementTree (e.g. in the form of lxml) that
has a different approach about text-nodes.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list