[xml] Libxml2 HTML parsing

Stan Santiago Thu, 11 Nov 2010 13:49:29 -0800

Greetings.

I just started using the Libxml2 library for HTML parsing. One of the 
requirements is to parse multiple HTML fragments separately and
combine the fragments into a single HTML document at the end. However, the 
<html/>, <body/> tags get added to each fragment that is processed.


I was looking at the thread at 
http://mail.gnome.org/archives/xml/2010-January/msg00112.html and it seems like 
this is exactly the same issue I have. I thought adding the
HTML_PARSE_NOIMPLIED option would resolve the issue but that doesn't seem to 
work.. In fact, the htmlCtxtUseOption(...) function doesn't
recognize the HTML_PARSE_NOIMPLIED option. 

Here is part of the source code I've written. I'm using the latest LibXML2 
2.7.8 version. The following code is executed for
each HTML fragment that is processed. 

...
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL,0, NULL, 
0);
int i = htmlCtxtUseOptions(parser, HTML_PARSE_RECOVER |HTML_PARSE_NOERROR | 
HTML_PARSE_NOWARNING | HTML_PARSE_NOIMPLIED);
printf("HTML CTXT %d\n",i); //prints 8192 which corresponds to 
HTML_PARSE_NOIMPLIED
htmlParseChunk(parser,  htmlFragment, strlen(htmlFragment), 0);
...
htmlNodeDump(buffer, doc, xmlDocGetRootElement(doc));; //Adds <html> and <body> 
tags for each fragment!

Any pointers or suggestions on how to work around this issue?

Thanks!
Stan

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

[xml] Libxml2 HTML parsing

Reply via email to