[PATCH] More slow performance in CDATA sections

Scott Sanders Wed, 12 Feb 2003 09:59:44 -0800

Hi all,

This patch request is a little long winded, but it needs to be to
explain the situation.  The short and sweet of it is the existing bug
(13776) covers large CDATA sections, but does not cover this particular
'edge case'.  This problem occurs in a 570KB file and rears its ugly
head by taking 30 seconds to parse into a DOM!  You heard right, 30
seconds.  By merely changing the content of the 570KB file (adjusting
what is actually in the CDATA section), the file parses in a mere 700ms.
A differential of over 40 times.


Just in case you think I'm insane, I have attached the test files that I
have used, and also some statistics in profiling, using OptimizeIt.

What it boils down to is the frequency in which XMLStringBuffer.append()
is called.  When the CDATA section contains XML-esque characters
('<',\n, etc but mostly newlines), the append() method is called with x
(usually 1-3) characters at a time.  Since the algorithm in append() is
to increase the buffer size by x+32, the number of array copies becomes
astoundingly large, therefore taking forever to parse the file.  When
the content in the CDATA section is changed to not look like XML, the
append() method is called much less often, causing less allocations
(thrashing).

I have attached a zip file which contains variations of the same file
with full formatting, no lines, no extra space, and no formatting, so
that the receiver can compare the variations of content. Note that these
times are with profiling turned on.  Just running the test standalone,
the high time is 30 seconds and the low time is 700ms.

Here are the profiling results for the attached content:
  Without allocation patch -
    Formatted - 37524ms @ 2989 invocations
    NoSpace - 36204ms @ 2908 invocations
    NoLines - 7160ms @ 249 invocations
    NoFormat - 6836ms @ 224 invocations

  With allocation patch -
    Formatted - 2440ms @ 1554 invocations
    NoSpace - 2360ms @ 1515 invocations
    NoLines - 2472ms @ 199 invocations
    NoFormat - 2404ms @ 124 invocations

Note: The above times specify the time spent in
org.apache.xerces.jaxp.DocumentBuilderImpl.parse() and the invocations
are of org.apache.xerces.util.XMLStringBuffer.append().

The following patch to XMLStringBuffer.append() changes the allocation
method to a simple doubling algorithm.  Note that the increase in
performance is anywhere from 3-16x!  Admittedly, this is a sort of
edge-case in Xerces, but parsing a 570KB file should NEVER take 30
seconds :)

Also note that this may cause one to think about refactoring
XMLEntityScanner to not call append() so many times, and I am also
willing to look into that if necessary.

The patch:

Index: java/src/org/apache/xerces/util/XMLStringBuffer.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/util/XMLStringBuff
er.java,v
retrieving revision 1.4
diff -u -r1.4 XMLStringBuffer.java
--- java/src/org/apache/xerces/util/XMLStringBuffer.java        29 Jan
2002 01:15:18 -0000      1.4
+++ java/src/org/apache/xerces/util/XMLStringBuffer.java        12 Feb
2003 17:57:58 -0000
@@ -171,7 +171,7 @@
     public void append(String s) {
         int length = s.length();
         if (this.length + length > this.ch.length) {
-            char[] newch = new char[this.ch.length + length +
DEFAULT_SIZE];
+            char[] newch = new char[(this.ch.length + length) * 2];
             System.arraycopy(this.ch, 0, newch, 0, this.length);
             this.ch = newch;
         }
@@ -188,7 +188,7 @@
      */
     public void append(char[] ch, int offset, int length) {
         if (this.length + length > this.ch.length) {
-            char[] newch = new char[this.ch.length + length +
DEFAULT_SIZE];
+            char[] newch = new char[(this.ch.length + length) * 2];
             System.arraycopy(this.ch, 0, newch, 0, this.length);
             this.ch = newch;
         }


Thanks,
Scott Sanders 

Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away. - Antoine de Saint-Exupery

BigCDataSection.zip
Description: BigCDataSection.zip

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[PATCH] More slow performance in CDATA sections

Reply via email to