DERBY-302 - Insert on Clob of 500k of data using streams takes long time. It takes 3 mins on a sun jvm1.4.2 and 20 seconds with ibm jvm 1.4.2. 
The following fix improves the performance of insert into a 500k blob from 20 seconds to around 1 second.  Note that by changing just the test program  time was reduced from 3 minutes to avg around 20 seconds.

Currently in derby,  for an insert on a clob using setCharacterStream what will happen is , the entire stream will be materialized into a char array and sent to store for the insert.  ( we should not have to stream here. I will file another jira issue for this and put in all information I learnt)

Given this is how inserts for large clobs are happening, the performance issue analysis is as follows:
--  profiler run shows that most time is spent in SQLChar.readExternal which is where the materialization into a char array for the user's input stream happens.  The growth of this array happens gradually till the entire stream is materialized into the array.  Below code snippet shows by how much the array is grown each time when it realizes it has to read more bytes from the stream.
 
The dvd hierarchy for clob is  -  SQLClob ( dvd) extends SQLVarChar extends SQLChar.

So in SQLChar.readExternal
........
    int growby = in.available();
   if(growby < 64)
       growby = 64
 and then an allocation and an arraycopy to the new allocated array.

--  In the code snippet,  'in' is the wrapper around the user's stream which is ReaderToUTF8Stream .   ReaderToUTF8Stream extends InputStream and  does not override available() method . As per the spec, InputStream.available() returns 0.

-- Thus each time, the array growth is by 64 bytes which is obviously not performant.  so for a 500k clob insert, this would mean allocation & arraycopy steps happen  ~8000 times.

-- The ReaderToUTF8Stream that has the user's stream reads from the stream and does the utf8 conversion and puts it in a 4k array.  I think it is reasonable to have a 32k buffer to store this information for clobs.

Although I think there seems to be more possible optimizations in this area,  I prefer the incremental approach too :)  so this patch  is a first step towards fixing the insert clob performance in the current system.

Fix includes:
-- enhanced the way the array was grown to keep the original  64 bytes for char ( seems reasonable given the upper limit for char) but override it to have  4k for varchar and clobs.
-- override available() in ReaderToUTF8Stream to return a better estimate of how many bytes can be read.

svn stat
M      java\engine\org\apache\derby\impl\jdbc\ReaderToUTF8Stream.java
M      java\engine\org\apache\derby\iapi\services\io\LimitReader.java
M      java\engine\org\apache\derby\iapi\types\SQLChar.java
M      java\engine\org\apache\derby\iapi\types\SQLVarchar.java

-- ran derbyall ok with sun jvm.

--  I can add a test and compare times but I realize that is probably not the best solution here.  It should ideally be part of a performance regression suite.

Numbers for clob inserts in seconds for one insert  on my laptop  - as per the jira issue.
With fix , times are in seconds for 1 insert on a clob on my laptop (windows, 1G ram, 1.6Ghz Intel Pentium(M) )

FileSize            ibm jvm 1.4.2      sun jvm 1.4.2         sun jvm 1.5
500k               0.9s                     1.6s                        1.7s
1M                  2.1s                     4s                           5s
2M                  3s                        9s                           11s
4M                  7s                        18s                         22s


Without the fix, 500k with sun jvm takes 3 mins and ibm jvm takes 20 seconds.
I will add the test program along with the input files to jira issue.
_________________________
Without this fix :  As I already mentioned in the jira comment for derby302,  I changed the program in the attached jira entry to use BufferedReader with the buffersize set to a bigger value than the default ( to 64k) brought down the times for sun jvm closer to ibm jvm.  I noticed that in my test, if I ran the test multiple times and did multiple inserts the performance of sun jvm and ibm jvm for 500k clob was around 20 seconds - guess the jit kicks in , plus the OS cache may also be a factor..
________________________
Please review it and if there are no comments, can a committer please commit it.

Thanks,
Sunitha.
Index: java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java
===================================================================
--- java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java      
(revision 189773)
+++ java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java      
(working copy)
@@ -47,7 +47,7 @@
        ReaderToUTF8Stream(LimitReader reader)
        {
                this.reader = reader;
-               buffer = new byte[4096];
+               buffer = new byte[32768];// seems reasonable to do 32k reads 
                blen = -1;
        }
 
@@ -196,5 +196,19 @@
                reader.close();
        }
 
+    /**
+     * Return an optimized version of bytes available to read from 
+     * the stream 
+     */
+    public int available()
+    {
+       int remainingBytes = reader.getLimit();
+       // this object buffers (32k - 6)bytes that can be read 
+       // and when that is finished it reads the next available bytes
+       // from the reader object 
+       // reader.getLimit() returns the remaining bytes available
+       // on this stream
+       return (32768 > remainingBytes ? remainingBytes : 32768);
+    }
 }
 
Index: java/engine/org/apache/derby/iapi/services/io/LimitReader.java
===================================================================
--- java/engine/org/apache/derby/iapi/services/io/LimitReader.java      
(revision 189773)
+++ java/engine/org/apache/derby/iapi/services/io/LimitReader.java      
(working copy)
@@ -119,6 +119,16 @@
                limitInPlace = true;
                return;
        }
+    
+    /**
+     * return limit of the stream that can be read without throwing
+     * EOFException
+     * @return the remaining bytes left to be read from the stream
+     */
+    public int getLimit()
+    {
+        return remainingBytes;
+    }
 
        /**
                Clear any limit set by setLimit. After this call no limit 
checking
Index: java/engine/org/apache/derby/iapi/types/SQLChar.java
===================================================================
--- java/engine/org/apache/derby/iapi/types/SQLChar.java        (revision 
189773)
+++ java/engine/org/apache/derby/iapi/types/SQLChar.java        (working copy)
@@ -641,6 +641,11 @@
         int utflen = in.readUnsignedShort();
 
         int requiredLength;
+        // minimum amount that is reasonable to grow the array
+        // when we know the array needs to growby at least one
+        // byte but we dont want to grow by one byte as that
+        // is not performant
+        int minGrowBy = growBy();
         if (utflen != 0)
         {
             // the object was not stored as a streaming column 
@@ -654,8 +659,8 @@
             // OR
             // The original string was a 0 length string.
             requiredLength = in.available();
-            if (requiredLength < 64)
-                requiredLength = 64;
+            if (requiredLength < minGrowBy)
+                requiredLength = minGrowBy;
         }
 
         char str[];
@@ -707,12 +712,26 @@
             if (strlen >= arrayLength) // the char array needs to be grown 
             {
                 int growby = in.available();
-
-                // We know at the array needs to be grown by at least one.
+                // We know that the array needs to be grown by at least one.
                 // However, even if the input stream wants to block on every
                 // byte, we don't want to grow by a byte at a time.
-                if (growby < 64)
-                    growby = 64;
+                // Note, for large data (clob > 32k), it is performant
+                // to grow the array by atleast 4k rather than a small amount
+                // Even better maybe to grow by 32k but then may be
+                // a little excess(?) for small data. 
+                // hopefully in.available() will give a fair
+                // estimate of how much data can be read to grow the 
+                // array by larger and necessary chunks.
+                // This performance issue due to 
+                // the slow growth of this array was noticed since inserts
+                // on clobs was taking a really long time as
+                // the array here grew previously by 64 bytes each time 
+                // till stream was drained.  (Derby-302)
+                // for char, growby 64 seems reasonable, but for varchar
+                // clob 4k or 32k is performant and hence
+                // growBy() is override correctly to ensure this
+                if (growby < minGrowBy)
+                    growby = minGrowBy;
 
                 int newstrlength = arrayLength + growby;
                 char oldstr[] = str;
@@ -804,6 +823,18 @@
         cKey = null;
     }
 
+    /**
+     * returns the reasonable minimum amount by 
+     * which the array can grow . See readExternal. 
+     * when we know that the array needs to grow by at least
+     * one byte, it is not performant to grow by just one byte
+     * instead this amount is used to provide a resonable growby size.
+     * @return minimum reasonable growby size
+     */
+    protected int growBy()
+    {
+        return 64;  //seems reasonable for a char
+    }
        /**
         * @see Storable#restoreToNull
         *
Index: java/engine/org/apache/derby/iapi/types/SQLVarchar.java
===================================================================
--- java/engine/org/apache/derby/iapi/types/SQLVarchar.java     (revision 
189773)
+++ java/engine/org/apache/derby/iapi/types/SQLVarchar.java     (working copy)
@@ -182,4 +182,17 @@
        {
                return TypeId.VARCHAR_PRECEDENCE;
        }
+    
+    /**
+     * returns the reasonable minimum amount by 
+     * which the array can grow . See readExternal. 
+     * when we know that the array needs to grow by at least
+     * one byte, it is not performant to grow by just one byte
+     * instead this amount is used to provide a resonable growby size.
+     * @return minimum reasonable growby size
+     */
+    protected int growBy()
+    {
+        return 4096;  //seems reasonable for a varchar or clob 
+    }
 }

Reply via email to