Re: Improving data insert performance
Hi Bryan, Thanks for replying. I'll try to answer some of your questions point by point: - have you tried using in-memory databases rather than test duration? I have done some testing with in-memory databases, but I get virtually identical results. This leads me to believe that my bottleneck is probably not disk i/o. - have you experimented with using a larger page size to minimize per-page overhead? Yes. Right now I am using a page size of 32768, which appears to offer a modest improvement over 4096. Sizes larger than that don't seem to much of an effect. - have you experimented with giving a larger-than-default page cache? Yes. Right now I am using a page cache size of 2000. Again, sizes larger than that don't seem to have much of an effect. - is your system CPU-bound? is there any disk io occurring at all? Yes, it appears to be CPU-bound. I'm not sure how much disk i/o is really occurring, but we are definitely using up all available cpu time. Unfortunately we only use all the CPU time on a single core. - how much garbage are you generating, what is the GC profile like? Not too much. Garbage collections take up about 8 seconds over the course of 5 minutes of work. The heap size is less than the maximum allocated, so I am not thrashing. (I have had issues where index generation uses an amazing amount of memory, but that is a story for another day.) - what rev of the JDK are you running, and on what operating system? Sun JDK 6u16, on Ubuntu 9.10. I am using Derby 10.5.3.0. - is your system multi-core? Can you throw threads/cores at this problem? Yes. I have been doing my work on a 1.86 Ghz Core 2 and, more recently, on a 2.8 Ghz i7. I haven't had much luck throwing more threads/cores at the problem, however. I have tried to run some tests where, instead of just prepping a file into a single table, I run several threads that each prep the file into a different table. Unfortunately, I don't actually get more throughput this way, and while my threads are clearly running on different cores, I still only end up using a single core's worth of cpu. I assume this means that somewhere underneath the hood everything is being serialized. - can you get any sort of profiling data about where the hotspots are during your run? Well, YourKit has helped me a bit in this regard. When I am executing prepared statements in batches, around 90% of the program time is spent inside the EmbedStatement.executeBatch method. Unfortunately I don't know exactly what it is doing in there. Similarly, I can see that when I am inserting everything using a table function (INSERT INTO MYTABLE (...) SELECT S.* FROM TABLE (MYFUNC ()) S) , about 90% of the program is spent inside the method EmbedStatement.execute, of which only 5% of that time is spent getting information from the parser. Again, I don't know where the rest goes. Anyway, if you or anyone else has advice on what else I can do to improve my performance, I'd love to hear it. Thanks again for your time. Cheers, Nathan On Wed, Jan 13, 2010 at 12:30 PM, Bryan Pendleton bpendle...@amberpoint.com wrote: using prepared statements executed in batches, and this gives me between 10 and 20 row inserts per millisecond on average. I have Wow! You're already going very fast, and you've already put a lot of good effort into this, so I congratulate you on how far you've gone already. A couple of thoughts occurred to me as I read your post: - have you tried using in-memory databases rather than test duration? - have you experimented with using a larger page size to minimize per-page overhead? - have you experimented with giving a larger-than-default page cache? Also, have you done any system-level analysis of your application during the insert: - is your system CPU-bound? is there any disk io occurring at all? - how much garbage are you generating, what is the GC profile like? - what rev of the JDK are you running, and on what operating system? - is your system multi-core? Can you throw threads/cores at this problem? - can you get any sort of profiling data about where the hotspots are during your run? Good luck with your performance work, it sounds extremely interesting! thanks, bryan
Improving data insert performance
Hello, I have an embedded database application that generally involves inserting somewhere between 50k and 1000k rows of data into a database, and then analyzing and querying that data afterwards. The data goes into about ten tables, but the bulk of the data is in just a few of them. I run my database with test durability, and I add all of the primary key and foreign key constraints after the data is inserted into the database. Currently all of the data is inserted using prepared statements executed in batches, and this gives me between 10 and 20 row inserts per millisecond on average. I have spent quite a bit of time optimizing the insert step, and while I was doing research I came across this discussion from last year: http://www.mail-archive.com/derby-user@db.apache.org/msg10194.html The discussion suggests using bulk import as a way to speed up this initial insert step. Unfortunately, I cannot use the built in import functions, as my data includes Timestamps with nanosecond granularity. As far as I can tell, there is no way to convince derby to parse a time specified down to the nanosecond. In one of the emails, someone suggested that you can get bulk import performance by using a table function, and then running a query like INSERT INTO MYTABLE (...) SELECT S.* FROM TABLE (MYFUNC ()) S. In my tests, however, this doesn't seem to perform the insert any faster than simply inserting the rows one at a time with a prepared statement. I think this may be because I don't have a way to set the 'insertMode=bulkImport' property, which the bulk import system procedure is allowed to do. Does anyone know of a way to work around this, or of a better way to get my data into the database as quickly as possible? Thanks in advance for your time. Cheers, Nathan Boy
Re: Improving data insert performance
using prepared statements executed in batches, and this gives me between 10 and 20 row inserts per millisecond on average. I have Wow! You're already going very fast, and you've already put a lot of good effort into this, so I congratulate you on how far you've gone already. A couple of thoughts occurred to me as I read your post: - have you tried using in-memory databases rather than test duration? - have you experimented with using a larger page size to minimize per-page overhead? - have you experimented with giving a larger-than-default page cache? Also, have you done any system-level analysis of your application during the insert: - is your system CPU-bound? is there any disk io occurring at all? - how much garbage are you generating, what is the GC profile like? - what rev of the JDK are you running, and on what operating system? - is your system multi-core? Can you throw threads/cores at this problem? - can you get any sort of profiling data about where the hotspots are during your run? Good luck with your performance work, it sounds extremely interesting! thanks, bryan
Re: Improving data insert performance
Also try using larger log files (10+MB) and if you rarely change data you load (which I guess is your case) you may want to use pageReservedSpace=0 and large pageSize (32kb?) On Wed, Jan 13, 2010 at 5:01 PM, Nathan Boy nathan@gmail.com wrote: Hello, I have an embedded database application that generally involves inserting somewhere between 50k and 1000k rows of data into a database, and then analyzing and querying that data afterwards. The data goes into about ten tables, but the bulk of the data is in just a few of them. I run my database with test durability, and I add all of the primary key and foreign key constraints after the data is inserted into the database. Currently all of the data is inserted using prepared statements executed in batches, and this gives me between 10 and 20 row inserts per millisecond on average. I have spent quite a bit of time optimizing the insert step, and while I was doing research I came across this discussion from last year: http://www.mail-archive.com/derby-user@db.apache.org/msg10194.html The discussion suggests using bulk import as a way to speed up this initial insert step. Unfortunately, I cannot use the built in import functions, as my data includes Timestamps with nanosecond granularity. As far as I can tell, there is no way to convince derby to parse a time specified down to the nanosecond. In one of the emails, someone suggested that you can get bulk import performance by using a table function, and then running a query like INSERT INTO MYTABLE (...) SELECT S.* FROM TABLE (MYFUNC ()) S. In my tests, however, this doesn't seem to perform the insert any faster than simply inserting the rows one at a time with a prepared statement. I think this may be because I don't have a way to set the 'insertMode=bulkImport' property, which the bulk import system procedure is allowed to do. Does anyone know of a way to work around this, or of a better way to get my data into the database as quickly as possible? Thanks in advance for your time. Cheers, Nathan Boy