Re: [Virtuoso-users] Request for optimize virtuoso.ini configuration files

Patrick van Kleef Wed, 03 Feb 2010 22:56:06 +0000

Hi Marc-Alexandre,

I've a huge amount a triples in N3 format and some triples can have
very large literals (Complete Genbank in N3 and some literals are
complete sequences). The total amount of triples is above 6 billions
and uncompressed data is around 700 gigabytes. I just can't find a way
to completely load it into Virtuoso.

You did not mention the version of Virtuoso you are running, but ifyou are using the open source release i suggest you first upgrade tothe 6.1.0 we just released on SourceForge:


        http://sourceforge.net/projects/virtuoso

This version is capable of much bigger NumberOfBuffers settings whichyou need to load such large data sets. There are also a number ofbugfixes including some memory leak fixes that you might haveencountered.

I've tryed the provided script to load Bio2RDF data located here

http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfloading,

but even if it is visibly working, the server always eventually do a
segmentation fault. And since this script don't provide output to know
which triple he was trying to load before the crash happen, I can't
restart it at the point where it was.

We made our own version of this script using Perl and the TTLP_MT
procedure. With our version, we know where it was before a crash, so
we can start it again from that point. The important point to notice
here, is that server also crash at repetition after I restart the load
and the time between 2 crashs grow smaller each time. Eventually, it
become impossible to continue the load.


If you have not already, i suggest you first read this article:

http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning

We normally split large datafiles, into smaller portions and use abulkloader script:


http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoader

This script will then use multiple threads to read the data into theQUAD store. It keeps track of all the fragments it has processedcorrectly and keeps error information and line number on the failedfragments, so they can be fixed and retried.

I've try to do some tuning in the Virtuoso.ini myself, but I've only
manage to get minor speed boost.

Should I remove all indexes for the loading or add more?

The server use to do the loading is a 32 cores, 128 GB ram and a Raid
array of 1.5 TB.


That is a nice powerful box you have there.

Can you email me privately ([email protected]) your OperatingSystem and version, a current copy of your virtuoso.ini and thepartition layout of your raid array, so we can assist you inoptimizing your virtuoso.ini to use your resources more optimally.

I can provide you with the Genbank dump if you want to play with it.
If so, just tell me and I will give you the link and what correction
need to be done for loading it (There was some errors with the RDFizer
that generate the Genbank dump, but they are easily corrected with a
regexp before the load).


I will keep that in mind.

So do you have an optimize virtuoso.ini you can suggest me to help?

I am sure we can assist you. We operate databases like http://lod.openlinksw.com/ that contain close to 9billion quads and counting.

P.S.: We have other large N3 dump (even more than twice the size of
this one) that we can't manage to succeed at loading them. But I think
that if we can solve the problem with this dump, it will also work
with the others dumps.



Patrick

Re: [Virtuoso-users] Request for optimize virtuoso.ini configuration files

Reply via email to