Hi Jan,
We are looking into these performance issue you report and shall
respond back to you on them in a while.
With regards to the server crash you report when performing a SPARQL
insert query, was anything written to the Virtuoso Server log
(virtuoso.log) at the time of the crash and was a core file created
as a result ? If you can provide a test case for reproducing this
problem we would be keen to reproduce in house.
Best Regards
Hugh Williams
Professional Services
OpenLink Software
On 20 Jun 2008, at 12:12, Jan Stette wrote:
Hi all,
I've noticed a few performance problems while using the Virtuoso
JDBC driver (virtjdbc3.jar, from the Open Source release version
5.0.6) to upload triples to a Virtuoso server.
First of all, I've been trying to do batch queries using
Statement.addBatch() and Statement.executeBatch(). While this
executes OK, it appears that the driver isn't actually batching up
the queries to the database. It looks as if these queries are
executed one by one, with a round trip for each (the code in
VirtuosoStatement.executeBatch() and
VirtuosoResultSet.process_result() seems to confirm this). This
makes batch transactions no different from executing individual
queries, which isn't very efficient for large transations.
Another issue is that the driver is very slow while adding
statements to a batch using Statement.addBatch(). Profiling our
test application, we've seen >90% of the time spend inside a single
method: openlink.util.Vector.ensureCapacityHelper(int). Looking at
the code for this, it's a bit strange: this appears to be a copy of
the standard java.lang.Vector class with some changes. In
particular, ensureCapacityHelper(int) now reallocates and copies
the Vector content every time it's called. And this method is
called every time something is added to a Vector! The end result,
especially dealing with large batches, is very, very slow,
basically O(N^2) where N is the number of items added to the array.
Also, in VirtuosoStatement.addBatch(), it creates a Vector for the
batch passing in an increment size of 10. This means that even if
the above problem in the Vector class is fixed, it will re-allocate
the Vector every 10 additions. Just using the default value of 0
for this is much better, as the Vector will then double its
allocated size, hence there are only log2(N) reallocations of the
vector content.
We've patched our driver to work around these Vector problems, but
it would be nice to get a proper fix for this into the
distribution. It would be interesting to know why this Vector
class is used in the driver instead of the standard
java.lang.Vector one anyway, the original does seem a lot better...
When not doing batch updates but just individual SPARQL insert
queries we hit another problem: now the bottleneck appears to be
the server.
First of all, doing a single SPARQL insert query with a very large
number of triples in it causes the server to segfault. We realise
it's not necessarily reasonable to do such huge queries (~ 100,000
triples in a single query :-), but you probably want to return an
error instead of crashing!
Doing more reasonably sized queries, we find that it takes ~ 70
seconds to insert 100,000 triples via SPARQL insert statements.
This was done using 100 queries containing 1000 triples each. This
rate is quite a lot lower than the bulk load road as seen when
using the ttlp() stored procedure, which gives us rates of about 14
seconds per 100,000 triples. Is there any way we can get
approximately the same performance while doing bulk inserts, for
example by disabling indexes while we're doing the upload?
Regards,
Jan
----------------------------------------------------------------------
---
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/
index.php_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users