Re: [Virtuoso-devel] loading uniprot in Virtuoso

Kunal Patel Tue, 11 Mar 2008 16:35:22 -0700

Hi Ivan,
 
   I was able to load the RDF data for Uniprot (~600 million triples) in 
Virtuoso.  The load speed I got was consistently around 4500 - 5000 triples per 
second.  I used the function that you suggested below to do the loading.  Can I 
know as to why I am getting such slow speed compared to the numbers posted at 
http://virtuoso.openlinksw.com/wiki/main/Main/VOSBitmapIndexing


  Again the machine that I used for this test is a 4 AMD Opteron processor each 
at 2Ghz  with 16 GB RAM.  The OS is Suse Linux Enterprise Server.

Kunal

Ivan Mikhailov <[email protected]> wrote: Kunal,

No, LUBM_LOAD_LOG2 uses single-threaded parsers in parallel. It's OK for
big number of files and big number of CPU cores because it can load all
cores without much lock contention. For UNIPROT case, it's probably
enough to

create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
{
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),
    'http://base_uri_1', 'destination_graph_1', log_mode, 3);
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),
    'http://base_uri_2', 'destination_graph_2', log_mode, 3);
...
  DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),
    'http://base_uri_9', 'destination_graph_9', log_mode, 3);
}

If you're starting from blank database and you can drop it and re-create
in case of error signalled, use it this way:

checkpoint;
checkpoint_interval(6000);
DB.DBA.UNIPROT_LOAD (0),
checkpoint;
checkpoint_interval(60);

If the database contains important data already and there's no way to
stop it and backup before the load then use

checkpoint;
checkpoint_interval(6000);
DB.DBA.UNIPROT_LOAD (),
checkpoint;
checkpoint_interval(60);


Best Regards,
Ivan Mikhailov,
OpenLink Software.

On Wed, 2008-02-13 at 15:19 -0800, Kunal Patel wrote:
> Hi Ivan,
> 
>   Thanks for the detailed response.  I downloaded the Uniprot KB from
> ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ (I am
> using all the files except uniparc.rdf.gz and uniref.rdf.gz) 
>   The relation between the various files is documented at
> http://dev.isb-sib.ch/projects/uniprot-rdf/intro.html
> 
>   Again to make sure that I understood you correctly, the best way to
> load the uniprot data for me would be to create a procedure similar to
> LUBM_LOAD_LOG2 (say UNIPROT_LOAD_LOG2) and call this procedure as
> follows,
> 
>       UNIPROT_LOAD_LOG2 (vector ('data-dir'), 3);
> 
>   This will use 3 processing threads per parsing.
> 
> Regards,
> Kunal
> 
> Ivan Mikhailov  wrote:
>         Kunal,
>         
>         I've downloaded uniprit_sprot.xml.gz (442729K) and
>         unprot_trembl.xml.gz(2858M) . Both are from
>         ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/ that
>         is
>         unavailable for me ATM .
>         Where can I get the rest? Should that files reside in a single
>         graph and
>         be queries as a single big set of triples or they have
>         different meaning
>         and should be queried separately (i.e. the location of a
>         triple is
>         important for what does it mean, e.g. reviewed data are
>         separated from
>         dirty drafts)? I'm weak in proteins, but I'd like to be ready
>         to more
>         UniProt-related queries because this data set is quite
>         popular.
>         
>         With only 4 CPUs single multithreaded parser can be the best
>         choice.
>         Note that the 'number of threads' parameter of
>         DB.DBA.RDF_LOAD_RDFXML()
>         mentions threads used to process data from file, an extra
>         thread will
>         read the text and parse it, so for 4 CPU cores there's no need
>         in
>         parameter value greater than 3. Three processing threads per
>         one parsing
>         tread is usually good ratio because parsing is usually three
>         times
>         faster than the rest of loading so CPU loading is well
>         balanced. I'm
>         using 2 x Quad Xeon so I will choose between 8 single-threaded
>         parsers
>         or 2 parsers with 3 processing threads each. With 4 cores you
>         may simply
>         load file after file with 3 processing threads.
>         
>         The most important performance tuning thing is to ensure that
>         you have
>         set proper
>         
>         NumberOfBuffers = 1000000
>         MaxDirtyBuffers = 800000
>         MaxCheckpointRemap = 1000000
>         
>         in [Parameters] section of virtuoso configuration file
>         (virtuoso.ini or
>         the like) .
>         
>         (Note for other readers: these numbers are reasonable for 16
>         GB RAM
>         Linux box, please refer to User's Guide before tweaking your
>         settings)
>         
>         You may note that 1 million of 8 kilobyte buffers is only 8
>         Gb, leaving
>         almost unused 8 Gb. This is done intentionally because some
>         Linux
>         installation demonstrated running out of OS physical memory
>         due to
>         fragmentation if almost all memory is allocated only once and
>         never
>         re-allocated during the run. It seems to be Linux-specific
>         problem of
>         memory allocator, at least during long data loading we've seen
>         cases of
>         stable size of the virtuoso process, zero activity of other
>         processes
>         and decreasing amount of available memory. We have no accurate
>         explanation and workaround for this phenomenon ATM. When there
>         are no
>         such massive operations as loading huge database, I set up to
>         
>         NumberOfBuffers = 1500000
>         MaxDirtyBuffers = 1200000
>         MaxCheckpointRemap = 1500000
>         
>         and it's still OK. Thus after loading all data you may wish to
>         shutdown,
>         tweak and start server again.
>         
>         If you have ext2fs or ext3fs filesystem then it's better to
>         have enough
>         free space on disk to not make it more than 80% full. When
>         it's almost
>         full it may allocate database file badly, resulting in
>         measurable loss
>         of disk access speed. That is not Virtuoso-specific fact, but
>         a common
>         hint for all database-like applications with random access to
>         big files.
>         
>         Best Regards,
>         Ivan Mikhailov,
>         OpenLink Software.
>         
>         On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote:
>         > Hi Ivan,
>         > 
>         > I am working with a 4 CPU machine with 16 GB RAM. The
>         UniProt data
>         > is distributed in 9 RDF files and 1 OWL file. 
>         > 
>         > The OWL file will act as the rule set for the RDF data. Most
>         of the
>         > RDF files are of reasonable size, except one which is of
>         size 41 GB.
>         > Do you have any suggestion on what load method
>         (multithreaded parsers
>         > OR asynchronous queue of singe threaded parsers) would be
>         best for
>         > this dataset.
>         > 
>         > Thanks,
>         > Kunal
>         
>         
> 
> 
> 
> 
> ______________________________________________________________________
> Never miss a thing. Make Yahoo your homepage. 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________ Virtuoso-devel mailing list 
> [email protected] 
> https://lists.sourceforge.net/lists/listinfo/virtuoso-devel



       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: [Virtuoso-devel] loading uniprot in Virtuoso

Reply via email to