On 2/15/11 12:19 PM, Marc-Alexandre Nolin wrote:
Hi,

To push the loading capacity of the open source Virtuoso I use 2
things at Bio2RDF.

1) Good size server (24 cores, 128 GB ram). I can't do much for you
here. The more ram the better and the more core, the more parrallel
loading... "up to a certain point with the free version"

2) Exploit the Multithread capability of Virtuoso. Its a combinaison
of script and how to prepare your data to be loaded.

I can't push it as much as the full commercial version of Virtuoso.
I've seen the full version and it has a bigger capacity. However, if
you are in a situation where the full commercial version is not an
option. Here is the way I use to load very large dataset on a single
Virtuoso.

- Here are 2 scripts that you will need.
-->  http://quebec.download.bio2rdf.org/script/parargs.py Its a
parrallel version of xargs in Linux
-->  http://quebec.download.bio2rdf.org/script/n32virtuoso-9.pl Simple
perl script that call ttlp_mt through an ISQL command

- Either find a way to split your huge RDF/XML file in a lot of self
contains files OR change the file format to RDF/ntriples ... not
RDF/N3. The difference between ntriples and N3 is that you can split
your file wherever you want and it won't matters since there is no
@PREFIX lines
-->  If your files are in RDF/ntriples and you are in Linux, you can
split the huge compressed file (change zcat for cat if it is
uncompress), rename the splitted files and recompress everything like
this:

[nolmar01@ls28 pubmed]$ zcat pubmed-00-18.n3.gz | split -l 1000000 -a
3 - splited_pubmed
[nolmar01@ls28 pubmed]$ ls splited_pubmed* | xargs -n1 -I {} mv {} {}.n3
[nolmar01@ls28 pubmed]$ ls *.n3 | /opt/Bio2rdf/perl/parargs.py -n24 gzip

- Your Virtuoso server need to have as much RAM as possible. I'm
increasing those parameters
NumberOfBuffers                 = 5000000
;NumberOfBuffers                        = 2000
MaxDirtyBuffers                 = 120000
;MaxDirtyBuffers                        = 1200
MaxCheckpointRemap             = 200000
;MaxCheckpointRemap              = 2000

I'm not a Virtuoso specialist, but I've seen increase performance with
bigger number there. Try and error with your own server :)

- Then, depending of the amount of core you have available, start X
n32virtuoso-9.pl with in parameters the list of splitted files
previously created. Use the python script parargs.py to start them
like this:

ls /bio2rdf2/data/entrezgene/All_Protozoa/split.All_Protozoa.* |
/opt/Bio2rdf/perl/parargs.py -n20 perl
/opt/Bio2rdf/perl/n32virtuoso-9.pl geneid 2222 dbabio2rdf 689 2>
All_Protozoa.log&  2>&1

With that command, I've seen the virtuoso process take as much as 18
cores completely on the server.

In this example, I create the list of my splitted file in the first
part of the query. Then I pipe the list in parargs and him to start 20
time the process that follow which is n32virtuoso-9.pl. The parameters
are explain in the n32virtuoso-9.pl file. Then I log the output in a
log file and redirect the stderr output in stdout. parargs will manage
which file of the list is sent to n32virtuoso-9.pl and will not send 2
times the same file. As the process run, you can know how much of the
files have been done by doing a cat of the done_files file in the same
directory you have start n32virtuoso-9.pl

With all that said, know that unless you have a big server and you
keep the big amount of memory you have used to load your file, then
your Virtuoso server will not be very responsive. For a commercial
uses, think about having the commercial version. For example, look at
http://refseq.bio2rdf.org/sparql which contains 3.5 billions triples.
It works, but don't ask queries bigger than a describes on it :)

By commercial, Marc means: Virtuoso Cluster Edition :-)


Kingsley
Bye !!

Marc-Alexandre

2011/2/15 Hugh Williams<hwilli...@openlinksw.com>:
Pierre,

Being are very large dataset have you tuned your Virtuoso Server for running on 
the target OS as detailed at:

        http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html

You can also use the bulk loader scripts we use for loading large datasets at 
if not already doing so:

        http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoader

Note we have quite sometime ago loaded other related  Bio2RDF and Nuerocommons 
datasets into an Amazon Virtuoso  EC2 AMI enabling the instantiation and loaded 
of these datasets in a fraction of the time it would take to load the from 
scratch:

        
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtInstallationEC2

Best Regards
Hugh Williams
Professional Services
OpenLink Software
Web: http://www.openlinksw.com
Support: http://support.openlinksw.com
Forums: http://boards.openlinksw.com/support
Twitter: http://twitter.com/OpenLink

On 15 Feb 2011, at 15:54, Pierre-Yves Chibon wrote:

Hi,

I have been trying to load the uniprot rdf file into virtuoso. Uniprot
provides a rather big file [1] which uncompressed is ~133GB.

I have been trying to load it into virtuoso (6.1.2) but it seems that
virtuoso's performance drops after a while and eventually hangs.
We tried to load it using the method in:
http://docs.openlinksw.com/virtuoso/fn_rdf_load_rdfxml_mt.html

I am then wondering if this is the best approach. I think bio2rdf uses a
perl script to load the data, would this be a preferred approach ? (Has
anyone such script for rdf files?)

Thanks in advance for your help,
Pierre


[1]
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprot.rdf.gz

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen






Reply via email to