Hi, We are using libxml2 with good success but we have hit a scalability problem with the reader API when large attributes are present. When elements have very large attributes the time it takes to parse the element is basically O(n^2). The typical problematic file is an SVG document with inlined image data (e.g., a base64 encoded jpeg in a data URL within an href), where attributes can easily be 600 K.
The problem appears to come from the fact that xmlTextReaderPushData() will only feed 512 bytes (CHUNK_SIZE) to xmlParseChunk() at a time. On each call to xmlParseChunk() xmlParseGetLasts() is called to find the start and end of the element, which of course cannot be found until the whole element is loaded into the buffer (e.g., 600 K), so the loop is repeated increasing the buffer size by just 512 bytes at a time, and each time the buffer is entirely rescanned looking for the '<' and '>'. This is slow on a fast PC but it becomes awfully slow on an embedded platform. Would you think doubling the chunk size fed to xmlParseChunk() on each iteration of the while loop in xmlTextReaderPushData() be a sane approach to lowering the complexity of parsing such documents ? Thanks, Diego -- Diego Santa Cruz, PhD Technology Architect _________________________________ SpinetiX S.A. Rue des Terreaux 17 1003, Lausanne, Switzerland T +41 21 341 15 50 F +41 21 311 19 56 [email protected] http://www.spinetix.com http://www.youtube.com/SpinetiXTeam _________________________________ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
