[xml] Crash while SAX parsing HTML

Giovanni Donelli Wed, 08 Jul 2009 12:16:16 -0700

Dear libXML folks,    I'm using libXML to parse a HTML, I was very happy
with it until I found a web page which makes my parser crash. I am writing
to inquire your help.


This is the webpage I am unable to parse successfully:
http://www.dlc.fi/~hurmari/index96.html

Here's my parser code:

int reentrantHTMLSAXParseMemory( const char *buffer, int size,
xmlSAXHandlerPtr sax, void *user_data, char* debugURL)
{
    int ret = 0;
    htmlParserCtxtPtr ctxt;
    ctxt = htmlCreateMemoryParserCtxt(buffer, size);
    if (ctxt == NULL)
return -1;

    ctxt->validate = 0;
    ctxt->sax = sax;
    ctxt->userData = user_data;
    htmlParseDocument(ctxt);
    if (ctxt->wellFormed)
        ret = 0;
    else
        ret = -1;
    if (sax != NULL)
        ctxt->sax = NULL;

    htmlFreeParserCtxt(ctxt);

    return ret;
}

Under OS X, the crash trace looks like this:

(gdb) bt
#0  0x00007fff82aaed4d in szone_malloc_should_clear ()
#1  0x00007fff82aaecea in malloc_zone_malloc ()
...
#7  0x000000010000663f in _startElement (my callback)
#8  0x00007fff828409b4 in htmlParseCharRef ()
#9  0x00007fff82842270 in htmlParseElement ()

(htmlParseElement repeated 2000+ times)

#2038 0x00007fff82842af8 in htmlParseElement ()
#2039 0x00007fff828430c8 in htmlParseDocument ()
...

The page in question has a lot of  <DD> which is the last tag processed
before the crash.

The fact that htmlParseElement() is repeated 2000+ times is very suspicious,
it looks like a stack over flow recursion.
what can I do to prevent the parser to go crazy in parsing this page, I
tried setting different flags of ctxt with no luck.


Thanks for your help,
Giovanni

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

[xml] Crash while SAX parsing HTML

Reply via email to