Re: [PATCH] More slow performance in CDATA sections

neilg Wed, 12 Feb 2003 10:56:22 -0800

Hi Scott,

A strange thing, opensource:  I created this bug over 6 months ago in an
attempt to fix a similar bug related to long comments.  There wasn't all
that much said until a little while gack when someone bothered Andy about
it; so he put in a fix shortly after 2.3.0 came out.  Then just yesterday
someone else asked on xerces-j-user.  You'd think CDATA use was taking off
or something.  :)


Anyway, please do a build from CVS (or grab a nightly build from gump) and
let us know if the problem isn't fixed.

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+---------------------------->
|         |           "Scott Sanders"  |
|         |           <ssanders@nextanc|
|         |           e.com>           |
|         |                            |
|         |           02/12/2003 01:00 |
|         |           PM               |
|         |           Please respond to|
|         |           xerces-j-dev     |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                    
                                                         |
  |       To:       <[EMAIL PROTECTED]>                                      
                                                         |
  |       cc:       "Scott Sanders" <[EMAIL PROTECTED]>                            
                                                         |
  |       Subject:  [PATCH] More slow performance in CDATA sections                    
                                                         |
  |                                                                                    
                                                         |
  |                                                                                    
                                                         |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|




Hi all,

This patch request is a little long winded, but it needs to be to
explain the situation.  The short and sweet of it is the existing bug
(13776) covers large CDATA sections, but does not cover this particular
'edge case'.  This problem occurs in a 570KB file and rears its ugly
head by taking 30 seconds to parse into a DOM!  You heard right, 30
seconds.  By merely changing the content of the 570KB file (adjusting
what is actually in the CDATA section), the file parses in a mere 700ms.
A differential of over 40 times.

Just in case you think I'm insane, I have attached the test files that I
have used, and also some statistics in profiling, using OptimizeIt.

What it boils down to is the frequency in which XMLStringBuffer.append()
is called.  When the CDATA section contains XML-esque characters
('<',\n, etc but mostly newlines), the append() method is called with x
(usually 1-3) characters at a time.  Since the algorithm in append() is
to increase the buffer size by x+32, the number of array copies becomes
astoundingly large, therefore taking forever to parse the file.  When
the content in the CDATA section is changed to not look like XML, the
append() method is called much less often, causing less allocations
(thrashing).

I have attached a zip file which contains variations of the same file
with full formatting, no lines, no extra space, and no formatting, so
that the receiver can compare the variations of content. Note that these
times are with profiling turned on.  Just running the test standalone,
the high time is 30 seconds and the low time is 700ms.

Here are the profiling results for the attached content:
  Without allocation patch -
    Formatted - 37524ms @ 2989 invocations
    NoSpace - 36204ms @ 2908 invocations
    NoLines - 7160ms @ 249 invocations
    NoFormat - 6836ms @ 224 invocations

  With allocation patch -
    Formatted - 2440ms @ 1554 invocations
    NoSpace - 2360ms @ 1515 invocations
    NoLines - 2472ms @ 199 invocations
    NoFormat - 2404ms @ 124 invocations

Note: The above times specify the time spent in
org.apache.xerces.jaxp.DocumentBuilderImpl.parse() and the invocations
are of org.apache.xerces.util.XMLStringBuffer.append().

The following patch to XMLStringBuffer.append() changes the allocation
method to a simple doubling algorithm.  Note that the increase in
performance is anywhere from 3-16x!  Admittedly, this is a sort of
edge-case in Xerces, but parsing a 570KB file should NEVER take 30
seconds :)

Also note that this may cause one to think about refactoring
XMLEntityScanner to not call append() so many times, and I am also
willing to look into that if necessary.

The patch:

Index: java/src/org/apache/xerces/util/XMLStringBuffer.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/util/XMLStringBuff
er.java,v
retrieving revision 1.4
diff -u -r1.4 XMLStringBuffer.java
--- java/src/org/apache/xerces/util/XMLStringBuffer.java        29 Jan
2002 01:15:18 -0000      1.4
+++ java/src/org/apache/xerces/util/XMLStringBuffer.java        12 Feb
2003 17:57:58 -0000
@@ -171,7 +171,7 @@
     public void append(String s) {
         int length = s.length();
         if (this.length + length > this.ch.length) {
-            char[] newch = new char[this.ch.length + length +
DEFAULT_SIZE];
+            char[] newch = new char[(this.ch.length + length) * 2];
             System.arraycopy(this.ch, 0, newch, 0, this.length);
             this.ch = newch;
         }
@@ -188,7 +188,7 @@
      */
     public void append(char[] ch, int offset, int length) {
         if (this.length + length > this.ch.length) {
-            char[] newch = new char[this.ch.length + length +
DEFAULT_SIZE];
+            char[] newch = new char[(this.ch.length + length) * 2];
             System.arraycopy(this.ch, 0, newch, 0, this.length);
             this.ch = newch;
         }


Thanks,
Scott Sanders

Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away. - Antoine de Saint-Exupery


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

#### BigCDataSection.zip has been removed from this note on February 12
2003 by Neil Graham



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [PATCH] More slow performance in CDATA sections

Reply via email to