Hi all, I've noticed a few performance problems while using the Virtuoso JDBC driver (virtjdbc3.jar, from the Open Source release version 5.0.6) to upload triples to a Virtuoso server.
First of all, I've been trying to do batch queries using Statement.addBatch() and Statement.executeBatch(). While this executes OK, it appears that the driver isn't actually batching up the queries to the database. It looks as if these queries are executed one by one, with a round trip for each (the code in VirtuosoStatement.executeBatch() and VirtuosoResultSet.process_result() seems to confirm this). This makes batch transactions no different from executing individual queries, which isn't very efficient for large transations. Another issue is that the driver is very slow while adding statements to a batch using Statement.addBatch(). Profiling our test application, we've seen >90% of the time spend inside a single method: openlink.util.Vector.ensureCapacityHelper(int). Looking at the code for this, it's a bit strange: this appears to be a copy of the standard java.lang.Vector class with some changes. In particular, ensureCapacityHelper(int) now reallocates and copies the Vector content every time it's called. And this method is called every time something is added to a Vector! The end result, especially dealing with large batches, is very, very slow, basically O(N^2) where N is the number of items added to the array. Also, in VirtuosoStatement.addBatch(), it creates a Vector for the batch passing in an increment size of 10. This means that even if the above problem in the Vector class is fixed, it will re-allocate the Vector every 10 additions. Just using the default value of 0 for this is much better, as the Vector will then double its allocated size, hence there are only log2(N) reallocations of the vector content. We've patched our driver to work around these Vector problems, but it would be nice to get a proper fix for this into the distribution. It would be interesting to know why this Vector class is used in the driver instead of the standard java.lang.Vector one anyway, the original does seem a lot better... When not doing batch updates but just individual SPARQL insert queries we hit another problem: now the bottleneck appears to be the server. First of all, doing a single SPARQL insert query with a very large number of triples in it causes the server to segfault. We realise it's not necessarily reasonable to do such huge queries (~ 100,000 triples in a single query :-), but you probably want to return an error instead of crashing! Doing more reasonably sized queries, we find that it takes ~ 70 seconds to insert 100,000 triples via SPARQL insert statements. This was done using 100 queries containing 1000 triples each. This rate is quite a lot lower than the bulk load road as seen when using the ttlp() stored procedure, which gives us rates of about 14 seconds per 100,000 triples. Is there any way we can get approximately the same performance while doing bulk inserts, for example by disabling indexes while we're doing the upload? Regards, Jan