On Thu, May 10, 2012 at 08:17:25PM -0700, Conrad Irwin wrote:
> Hi Veillard and all,
>
> Firstly, thanks for libxml: it's awesome!
>
> I noticed recently that libxml was taking a surprisingly long time to perform
> some
> operations (many minutes instead of milliseconds), and so I did some digging.
> It turns out
> that the problem was caused by the realloc()ing done in
> xmlNodeAddContentLen() which can
> be called many (many) times when assigning some content into a node.
>
> For background, I'm dealing with XML that contains emails, these can have
> large
> attachments (~6MB) which are base-64 encoded, line-wrapped at 78 chars, and
> each line ends
> with . This means that xmlNodeAddContentLen() is being called about
> 200,000 times,
> and so there are 200,000 reallocs of a 6MB string, which takes a while... (I
> put a synthetic
> example of this at https://gist.github.com/2656940)
>
> The attached patch works around that problem by using the existing buffer API
> to merge the
> strings together before even creating the text node, this keeps the number of
> realloc()s
> at a managable level.
>
> I'd love feedback on the patch, and am happy to fix problems with it, or
> explore other
> solutions if you think that this is barking up the wrong tree :).
Hi Conrad,
that's interesting ! I was initially afraid of a sudden explosion of
memory allocations for building a tree since by default buffers tend to
"waste" memory by using doubling allocations, but that's not the case.
xmllint --noout doc/libxml2-api.xml
when compiled with memory debug produce
paphio:~/XML -> cat .memdump
MEMORY ALLOCATED : 0, MAX was 12756699
and without your patch 12755657, i.e. the increase is minimal.
There is also the cost of creating the buffers all the time.
I need to read the code and check but I may be interested in an hybrid
approach where we switch to buffer only when the text node starts to
become too big (4k would remove nearly all usuall types of "document"
usage, i.e. not blocks of data)
> P.S. Should I create a bug for this too?
Hum, yes for tracking though I prefer to interract through the list
:-)
thanks !
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
[email protected] | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml