Re: [Virtuoso-users] Performance problems while uploading data

Jan Stette Fri, 20 Jun 2008 05:57:02 -0700

Hi Hugh,

I haven't got the logs or core dump at hand at the moment, but it should be
very easy to reproduce: just send the server a single query like "sparql
insert into graph<http://test> { <triple 1> . <triple 2> . <triple 3> .  ...
}", with 100,000 triples in the query.  As I said, I realise this is an
unreasonable thing to do, but it makes for an interesting test case anyway I
guess!


Let me know if you need more details to reproduce it anyway.

Regards,
Jan


2008/6/20 Hugh Williams <hwilli...@openlinksw.com>:

> Hi Jan,
>
> We are looking into these performance issue you report and shall respond
> back to you on them in a while.
>
> With regards to the server crash you report when performing a SPARQL insert
> query, was anything written to the Virtuoso Server log (virtuoso.log) at the
> time of the crash and was a core file created as a result ? If you can
> provide a test case for reproducing this problem we would be keen to
> reproduce in house.
>
> Best Regards
> Hugh Williams
> Professional Services
> OpenLink Software
>
>
> On 20 Jun 2008, at 12:12, Jan Stette wrote:
>
>  Hi all,
>>
>> I've noticed a few performance problems while using the Virtuoso JDBC
>> driver (virtjdbc3.jar, from the Open Source release version 5.0.6) to upload
>> triples to a Virtuoso server.
>>
>> First of all, I've been trying to do batch queries using
>> Statement.addBatch() and Statement.executeBatch().  While this executes OK,
>> it appears that the driver isn't actually batching up the queries to the
>> database.  It looks as if these queries are executed one by one, with a
>> round trip for each (the code in VirtuosoStatement.executeBatch() and
>> VirtuosoResultSet.process_result() seems to confirm this).  This makes batch
>> transactions no different from executing individual queries, which isn't
>> very efficient for large transations.
>>
>> Another issue is that the driver is very slow while adding statements to a
>> batch using Statement.addBatch().  Profiling our test application, we've
>> seen >90% of the time spend inside a single method:
>> openlink.util.Vector.ensureCapacityHelper(int).  Looking at the code for
>> this, it's a bit strange: this appears to be a copy of the standard
>> java.lang.Vector class with some changes.  In particular,
>> ensureCapacityHelper(int) now reallocates and copies the Vector content
>> every time it's called.  And this method is called every time something is
>> added to a Vector!  The end result, especially dealing with large batches,
>> is very, very slow, basically O(N^2) where N is the number of items added to
>> the array.
>> Also, in VirtuosoStatement.addBatch(), it creates a Vector for the batch
>> passing in an increment size of 10.  This means that even if the above
>> problem in the Vector class is fixed, it will re-allocate the Vector every
>> 10 additions.  Just using the default value of 0 for this is much better, as
>> the Vector will then double its allocated size, hence there are only log2(N)
>> reallocations of the vector content.
>> We've patched our driver to work around these Vector problems, but it
>> would be nice to get a proper fix for this into the distribution.  It would
>> be interesting to know why this Vector class is used in the driver instead
>> of the standard java.lang.Vector one anyway, the original does seem a lot
>> better...
>>
>> When not doing batch updates but just individual SPARQL insert queries we
>> hit another problem: now the bottleneck appears to be the server.
>> First of all, doing a single SPARQL insert query with a very large number
>> of triples in it causes the server to segfault.  We realise it's not
>> necessarily reasonable to do such huge queries (~ 100,000 triples in a
>> single query :-), but you probably want to return an error instead of
>> crashing!
>>
>> Doing more reasonably sized queries, we find that it takes ~ 70 seconds to
>> insert 100,000 triples via SPARQL insert statements.  This was done using
>> 100 queries containing 1000 triples each.  This rate is quite a lot lower
>> than the bulk load road as seen when using the ttlp() stored procedure,
>> which gives us rates of about 14 seconds per 100,000 triples.  Is there any
>> way we can get approximately the same performance while doing bulk inserts,
>> for example by disabling indexes while we're doing the upload?
>>
>> Regards,
>> Jan
>>
>> -------------------------------------------------------------------------
>> Check out the new SourceForge.net Marketplace.
>> It's the best place to buy or sell services for
>> just about anything Open Source.
>> http://sourceforge.net/services/buy/
>> index.php_______________________________________________
>> Virtuoso-users mailing list
>> Virtuoso-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/virtuoso-users
>>
>
>

Re: [Virtuoso-users] Performance problems while uploading data

Reply via email to