Hi Marc-Alexandre,
I've a huge amount a triples in N3 format and some triples can have
very large literals (Complete Genbank in N3 and some literals are
complete sequences). The total amount of triples is above 6 billions
and uncompressed data is around 700 gigabytes. I just can't find a way
to completely load it into Virtuoso.
You did not mention the version of Virtuoso you are running, but if
you are using the open source release i suggest you first upgrade to
the 6.1.0 we just released on SourceForge:
http://sourceforge.net/projects/virtuoso
This version is capable of much bigger NumberOfBuffers settings which
you need to load such large data sets. There are also a number of
bugfixes including some memory leak fixes that you might have
encountered.
I've tryed the provided script to load Bio2RDF data located here
http://docs.openlinksw.com/virtuoso/
rdfperformancetuning.html#rdfperfloading,
but even if it is visibly working, the server always eventually do a
segmentation fault. And since this script don't provide output to know
which triple he was trying to load before the crash happen, I can't
restart it at the point where it was.
We made our own version of this script using Perl and the TTLP_MT
procedure. With our version, we know where it was before a crash, so
we can start it again from that point. The important point to notice
here, is that server also crash at repetition after I restart the load
and the time between 2 crashs grow smaller each time. Eventually, it
become impossible to continue the load.
If you have not already, i suggest you first read this article:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/
VirtRDFPerformanceTuning
We normally split large datafiles, into smaller portions and use a
bulkloader script:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoader
This script will then use multiple threads to read the data into the
QUAD store. It keeps track of all the fragments it has processed
correctly and keeps error information and line number on the failed
fragments, so they can be fixed and retried.
I've try to do some tuning in the Virtuoso.ini myself, but I've only
manage to get minor speed boost.
Should I remove all indexes for the loading or add more?
The server use to do the loading is a 32 cores, 128 GB ram and a Raid
array of 1.5 TB.
That is a nice powerful box you have there.
Can you email me privately ([email protected]) your Operating
System and version, a current copy of your virtuoso.ini and the
partition layout of your raid array, so we can assist you in
optimizing your virtuoso.ini to use your resources more optimally.
I can provide you with the Genbank dump if you want to play with it.
If so, just tell me and I will give you the link and what correction
need to be done for loading it (There was some errors with the RDFizer
that generate the Genbank dump, but they are easily corrected with a
regexp before the load).
I will keep that in mind.
So do you have an optimize virtuoso.ini you can suggest me to help?
I am sure we can assist you. We operate databases like http://
lod.openlinksw.com/ that contain close to 9billion quads and counting.
P.S.: We have other large N3 dump (even more than twice the size of
this one) that we can't manage to succeed at loading them. But I think
that if we can solve the problem with this dump, it will also work
with the others dumps.
Patrick