Re: Improving data insert performance

2010-01-14 Thread Nathan Boy
Hi Bryan,
Thanks for replying.  I'll try to answer some of your questions point by point:

- have you tried using in-memory databases rather than test duration?
I have done some testing with in-memory databases, but I get virtually
identical results.  This leads me to believe that my bottleneck is
probably not disk i/o.

- have you experimented with using a larger page size to minimize
  per-page overhead?
Yes.  Right now I am using a page size of 32768, which appears to
offer a modest improvement over 4096.  Sizes larger than that don't
seem to much of an effect.

- have you experimented with giving a larger-than-default page cache?
Yes.  Right now I am using a page cache size of 2000.  Again, sizes
larger than that don't seem to have much of an effect.

- is your system CPU-bound? is there any disk io occurring at all?
Yes, it appears to be CPU-bound.  I'm not sure how much disk i/o is
really occurring, but we are definitely using up all available cpu
time.  Unfortunately we only use all the CPU time on a single core.

- how much garbage are you generating, what is the GC profile like?
Not too much.  Garbage collections take up about 8 seconds over the
course of 5 minutes of work.  The heap size is less than the maximum
allocated, so I am not thrashing. (I have had issues where index
generation uses an amazing amount of memory, but that is a story for
another day.)

- what rev of the JDK are you running, and on what operating system?
Sun JDK 6u16, on Ubuntu 9.10.  I am using Derby 10.5.3.0.

- is your system multi-core? Can you throw threads/cores at this problem?
Yes.  I have been doing my work on a 1.86 Ghz Core 2 and, more
recently, on a 2.8 Ghz i7.  I haven't had much luck throwing more
threads/cores at the problem, however.  I have tried to run some tests
where, instead of just prepping a file into a single table, I run
several threads that each prep the file into a different table.
Unfortunately, I don't actually get more throughput this way, and
while my threads are clearly running on different cores, I still only
end up using a single core's worth of cpu.  I assume this means that
somewhere underneath the hood everything is being serialized.

- can you get any sort of profiling data about where the hotspots
  are during your run?
Well, YourKit has helped me a bit in this regard.  When I am executing
prepared statements in batches, around 90% of the program time is
spent inside the EmbedStatement.executeBatch method.  Unfortunately I
don't know exactly what it is doing in there.  Similarly, I can see
that when I am inserting everything using a table function (INSERT
INTO MYTABLE (...) SELECT S.* FROM TABLE (MYFUNC ()) S) , about 90%
of the program is spent inside the method EmbedStatement.execute, of
which only 5% of that time is spent getting information from the
parser.  Again, I don't know where the rest goes.

Anyway, if you or anyone else has advice on what else I can do to
improve my performance, I'd love to hear it.  Thanks again for your
time.

Cheers,
Nathan

On Wed, Jan 13, 2010 at 12:30 PM, Bryan Pendleton
bpendle...@amberpoint.com wrote:
 using prepared statements executed in batches, and this gives me
 between 10 and 20 row inserts per millisecond on average.  I have

 Wow! You're already going very fast, and you've already put a lot
 of good effort into this, so I congratulate you on how far you've
 gone already.

 A couple of thoughts occurred to me as I read your post:

  - have you tried using in-memory databases rather than test duration?
  - have you experimented with using a larger page size to minimize
   per-page overhead?
  - have you experimented with giving a larger-than-default page cache?

 Also, have you done any system-level analysis of your application
 during the insert:
  - is your system CPU-bound? is there any disk io occurring at all?
  - how much garbage are you generating, what is the GC profile like?
  - what rev of the JDK are you running, and on what operating system?
  - is your system multi-core? Can you throw threads/cores at this problem?
  - can you get any sort of profiling data about where the hotspots
   are during your run?

 Good luck with your performance work, it sounds extremely interesting!

 thanks,

 bryan




Improving data insert performance

2010-01-13 Thread Nathan Boy
Hello,
   I have an embedded database application that generally involves
inserting somewhere between 50k and 1000k rows of data into a
database, and then analyzing and querying that data afterwards.  The
data goes into about ten tables, but the bulk of the data is in just a
few of them.  I run my database with test durability, and I add all
of the primary key and foreign key constraints after the data is
inserted into the database.  Currently all of the data is inserted
using prepared statements executed in batches, and this gives me
between 10 and 20 row inserts per millisecond on average.  I have
spent quite a bit of time optimizing the insert step, and while I was
doing research I came across this discussion from last year:

http://www.mail-archive.com/derby-user@db.apache.org/msg10194.html

The discussion suggests using bulk import as a way to speed up this
initial insert step.  Unfortunately, I cannot use the built in import
functions, as my data includes Timestamps with nanosecond granularity.
 As far as I can tell, there is no way to convince derby to parse a
time specified down to the nanosecond.  In one of the emails, someone
suggested that you can get bulk import performance by using a table
function, and then running a query like INSERT INTO MYTABLE (...)
SELECT S.* FROM TABLE (MYFUNC ()) S.  In my tests, however, this
doesn't seem to perform the insert any faster than simply inserting
the rows one at a time with a prepared statement.  I think this may be
because I don't have a way to set the 'insertMode=bulkImport'
property, which the bulk import system procedure is allowed to do.
Does anyone know of a way to work around this, or of a better way to
get my data into the database as quickly as possible?  Thanks in
advance for your time.

Cheers,
Nathan Boy


Re: Improving data insert performance

2010-01-13 Thread Bryan Pendleton

using prepared statements executed in batches, and this gives me
between 10 and 20 row inserts per millisecond on average.  I have


Wow! You're already going very fast, and you've already put a lot
of good effort into this, so I congratulate you on how far you've
gone already.

A couple of thoughts occurred to me as I read your post:

 - have you tried using in-memory databases rather than test duration?
 - have you experimented with using a larger page size to minimize
   per-page overhead?
 - have you experimented with giving a larger-than-default page cache?

Also, have you done any system-level analysis of your application
during the insert:
 - is your system CPU-bound? is there any disk io occurring at all?
 - how much garbage are you generating, what is the GC profile like?
 - what rev of the JDK are you running, and on what operating system?
 - is your system multi-core? Can you throw threads/cores at this problem?
 - can you get any sort of profiling data about where the hotspots
   are during your run?

Good luck with your performance work, it sounds extremely interesting!

thanks,

bryan



Re: Improving data insert performance

2010-01-13 Thread Peter Ondruška
Also try using larger log files (10+MB) and if you rarely change data
you load (which I guess is your case) you may want to use
pageReservedSpace=0 and large pageSize (32kb?)

On Wed, Jan 13, 2010 at 5:01 PM, Nathan Boy nathan@gmail.com wrote:
 Hello,
   I have an embedded database application that generally involves
 inserting somewhere between 50k and 1000k rows of data into a
 database, and then analyzing and querying that data afterwards.  The
 data goes into about ten tables, but the bulk of the data is in just a
 few of them.  I run my database with test durability, and I add all
 of the primary key and foreign key constraints after the data is
 inserted into the database.  Currently all of the data is inserted
 using prepared statements executed in batches, and this gives me
 between 10 and 20 row inserts per millisecond on average.  I have
 spent quite a bit of time optimizing the insert step, and while I was
 doing research I came across this discussion from last year:

 http://www.mail-archive.com/derby-user@db.apache.org/msg10194.html

 The discussion suggests using bulk import as a way to speed up this
 initial insert step.  Unfortunately, I cannot use the built in import
 functions, as my data includes Timestamps with nanosecond granularity.
  As far as I can tell, there is no way to convince derby to parse a
 time specified down to the nanosecond.  In one of the emails, someone
 suggested that you can get bulk import performance by using a table
 function, and then running a query like INSERT INTO MYTABLE (...)
 SELECT S.* FROM TABLE (MYFUNC ()) S.  In my tests, however, this
 doesn't seem to perform the insert any faster than simply inserting
 the rows one at a time with a prepared statement.  I think this may be
 because I don't have a way to set the 'insertMode=bulkImport'
 property, which the bulk import system procedure is allowed to do.
 Does anyone know of a way to work around this, or of a better way to
 get my data into the database as quickly as possible?  Thanks in
 advance for your time.

 Cheers,
 Nathan Boy