Hi all,

I've noticed a few performance problems while using the Virtuoso JDBC driver
(virtjdbc3.jar, from the Open Source release version 5.0.6) to upload
triples to a Virtuoso server.

First of all, I've been trying to do batch queries using
Statement.addBatch() and Statement.executeBatch().  While this executes OK,
it appears that the driver isn't actually batching up the queries to the
database.  It looks as if these queries are executed one by one, with a
round trip for each (the code in VirtuosoStatement.executeBatch() and
VirtuosoResultSet.process_result() seems to confirm this).  This makes batch
transactions no different from executing individual queries, which isn't
very efficient for large transations.

Another issue is that the driver is very slow while adding statements to a
batch using Statement.addBatch().  Profiling our test application, we've
seen >90% of the time spend inside a single method:
openlink.util.Vector.ensureCapacityHelper(int).  Looking at the code for
this, it's a bit strange: this appears to be a copy of the standard
java.lang.Vector class with some changes.  In particular,
ensureCapacityHelper(int) now reallocates and copies the Vector content
every time it's called.  And this method is called every time something is
added to a Vector!  The end result, especially dealing with large batches,
is very, very slow, basically O(N^2) where N is the number of items added to
the array.
Also, in VirtuosoStatement.addBatch(), it creates a Vector for the batch
passing in an increment size of 10.  This means that even if the above
problem in the Vector class is fixed, it will re-allocate the Vector every
10 additions.  Just using the default value of 0 for this is much better, as
the Vector will then double its allocated size, hence there are only log2(N)
reallocations of the vector content.
We've patched our driver to work around these Vector problems, but it would
be nice to get a proper fix for this into the distribution.  It would be
interesting to know why this Vector class is used in the driver instead of
the standard java.lang.Vector one anyway, the original does seem a lot
better...

When not doing batch updates but just individual SPARQL insert queries we
hit another problem: now the bottleneck appears to be the server.
First of all, doing a single SPARQL insert query with a very large number of
triples in it causes the server to segfault.  We realise it's not
necessarily reasonable to do such huge queries (~ 100,000 triples in a
single query :-), but you probably want to return an error instead of
crashing!

Doing more reasonably sized queries, we find that it takes ~ 70 seconds to
insert 100,000 triples via SPARQL insert statements.  This was done using
100 queries containing 1000 triples each.  This rate is quite a lot lower
than the bulk load road as seen when using the ttlp() stored procedure,
which gives us rates of about 14 seconds per 100,000 triples.  Is there any
way we can get approximately the same performance while doing bulk inserts,
for example by disabling indexes while we're doing the upload?

Regards,
Jan

Reply via email to