On Fri, Aug 29, 2008 at 11:51:54AM +0200, Daniel Veillard wrote: > On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote: > > Hi, > > > > we got a report on the lxml list where someone tried to parse and > > serialise a file that contains 8,000,000 non-ASCII character references > > (‡), as in > > > > "<text>" + "‡" * 8000000 + "</text>" > > > > Parsing this is pretty fast, so that's not the problem, but serialising > > this document back to a "US-ASCII" encoding, i.e. re-encoding the > > non-ASCII characters as character references, is slow as hell. The user > > stopped the run after 12 hours at 100% CPU load. I tried this with xmllint > > and you can literally wait for each byte that arrives in the target file. > > > > Is there any reason why this is so, or does anyone have any insights what > > the problem may be here? This definitely sounds like a bug to me. > > Well that's an horribly crappy XML document. > I assume the output buffer grows lineary, so you end up realloc'ing all > the time and hit a quadratic behaviour as a result, somehow the > reallocation of the buffer size should probably use a doubling at each > step algorithm. Plus the escaping is done while the ASCII encoder stops.
I looked at it, and sorry the test is so broken i can't fix the behaviour at least not in a simple and fair way. I will explain: You have 1 document. It contains 16 megabytes of UTF characters. You ask libxml2 to convert this on output to US-ASCII. The way libxml2 does when asked to convert data to a given encoding is to use iconv or the set of internal conversion routines. It will pass the data by big chunks (as large as reasonable possible to speed up the convertion process) to the encoder. Libxml2 has no apriori knowledge of what an encoding may support in term of range of character, and people can plug their own encoding handlers which may be different from what libxml2 would have done using the internal handlers. So basically libxml2 has absolutely no way when passing a character to the output converter to know if the caracter will be accepted ... or not. So libxml2 relies on the encoder to tell him if the character could not be converted, that's what it does in xmlCharEncOutFunc() in encoding.c around line 1950, it feeds the converter, which may accept to convert only a subset of the string. In that case the set written is pushed to the output buffer, and the return value -2 allow to find that there is a character from the UTF-8 input buffer which could not be converted. Libxml2 then removes it from the input buffer which was shrinked to that location, and repushes it at the beginning, and retry the conversion. In your pathological case *none* of the character actually pass the converter. So the exception mechanism has to be used for each of the 8 million characters, each time shrinking and growing an input buffer and retrying again, to fail converting etc... The CPU is loosing all its time doing memory copies on the shrinking and growing of the buffers and of course the loop repeating tries to convert characters and failing, then the sprintf. The conversion to character references is a fallback mechanism designed for exceptional cases, this exemple pushes it to the limit with a single huge input text block and none of the characters in it passing though the encoder. I could try to fix your specific case by making the internal ASCII converter of libxml2 automatically accept out of ASCII range characters and convert them to character references in the output. That would be absolute cheating, if for some reason iconv ascii converter was used instead of the internal one this would fail in the same way. Anyway to make a long story short, I spent a few hours today fixing the problem by adding support for a new kind of buffers avoiding most of the memmoves needed when handling an encoding conversion exception like those. Bill reported that your test now parses and save in more or less the same time i.e. 7secs on one of his boxes. Grab 2.7.0 ! Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ [EMAIL PROTECTED] | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
