bruce wrote: > I'm using quick test with libxml2dom > > =============== > import libxml2dom > > aa=libxml2dom.parseString(foo) > ff=libxml2dom.toString(aa) > > print ff > =============== > > ---------------------------------- > when i start, foo is: > <html> > <body> > </body> > </html> > > <html> > <body> > . > . > . > </body> > </html> > ------------------------------- > when i print ff it's: > <html> > <body> > </body> > </html> > ------------------------------- > > so it's as if the parseString only reads the initial "html" tree. i've > reviewed as much as i can find regarding libxml2dom to try to figure out how > i can get it to read/parse/handle both html trees/nodes. > > i know, the html is maligned/screwed-up, but i can't seem to find any app > (tidy/beautifulsoup) that can "know" which one of the html trees to throw > out/remove!! > > technically, both html trees are valid, it's just that they both shouldn't > be in the file!!!
What about splitting the string on "<html" and them parsing each part on its own? Stefan -- http://mail.python.org/mailman/listinfo/python-list