Dear libXML folks, I'm using libXML to parse a HTML, I was very happy with it until I found a web page which makes my parser crash. I am writing to inquire your help.
This is the webpage I am unable to parse successfully: http://www.dlc.fi/~hurmari/index96.html Here's my parser code: int reentrantHTMLSAXParseMemory( const char *buffer, int size, xmlSAXHandlerPtr sax, void *user_data, char* debugURL) { int ret = 0; htmlParserCtxtPtr ctxt; ctxt = htmlCreateMemoryParserCtxt(buffer, size); if (ctxt == NULL) return -1; ctxt->validate = 0; ctxt->sax = sax; ctxt->userData = user_data; htmlParseDocument(ctxt); if (ctxt->wellFormed) ret = 0; else ret = -1; if (sax != NULL) ctxt->sax = NULL; htmlFreeParserCtxt(ctxt); return ret; } Under OS X, the crash trace looks like this: (gdb) bt #0 0x00007fff82aaed4d in szone_malloc_should_clear () #1 0x00007fff82aaecea in malloc_zone_malloc () ... #7 0x000000010000663f in _startElement (my callback) #8 0x00007fff828409b4 in htmlParseCharRef () #9 0x00007fff82842270 in htmlParseElement () (htmlParseElement repeated 2000+ times) #2038 0x00007fff82842af8 in htmlParseElement () #2039 0x00007fff828430c8 in htmlParseDocument () ... The page in question has a lot of <DD> which is the last tag processed before the crash. The fact that htmlParseElement() is repeated 2000+ times is very suspicious, it looks like a stack over flow recursion. what can I do to prevent the parser to go crazy in parsing this page, I tried setting different flags of ctxt with no luck. Thanks for your help, Giovanni
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
