Re: [Virtuoso-devel] loading uniprot in Virtuoso

Ivan Mikhailov Tue, 11 Mar 2008 21:44:11 -0700

Kunal,

Can I look at your virtuoso.ini file and the output of 'dmesg' command?
(If the output of dmesg is long then it might be better to mail it only
to me, not to the mailing list).


Best Regards,

Ivan Mikhailov,
OpenLink Software.

On Tue, 2008-03-11 at 16:35 -0700, Kunal Patel wrote:
> Hi Ivan,
> 
>   I was able to load the RDF data for Uniprot (~600 million triples)
> in Virtuoso.  The load speed I got was consistently around 4500 - 5000
> triples per second.  I used the function that you suggested below to
> do the loading.  Can I know as to why I am getting such slow speed
> compared to the numbers posted at
> http://virtuoso.openlinksw.com/wiki/main/Main/VOSBitmapIndexing   
> 
>   Again the machine that I used for this test is a 4 AMD Opteron
> processor each at 2Ghz  with 16 GB RAM.  The OS is Suse Linux
> Enterprise Server.
> 
> Kunal
> 
> Ivan Mikhailov <[email protected]> wrote:
>         Kunal,
>         
>         No, LUBM_LOAD_LOG2 uses single-threaded parsers in parallel.
>         It's OK for
>         big number of files and big number of CPU cores because it can
>         load all
>         cores without much lock contention. For UNIPROT case, it's
>         probably
>         enough to
>         
>         create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
>         {
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),
>         'http://base_uri_1', 'destination_graph_1', log_mode, 3);
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),
>         'http://base_uri_2', 'destination_graph_2', log_mode, 3);
>         ...
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),
>         'http://base_uri_9', 'destination_graph_9', log_mode, 3);
>         }
>         
>         If you're starting from blank database and you can drop it and
>         re-create
>         in case of error signalled, use it this way:
>         
>         checkpoint;
>         checkpoint_interval(6000);
>         DB.DBA.UNIPROT_LOAD (0),
>         checkpoint;
>         checkpoint_interval(60);
>         
>         If the database contains important data already and there's no
>         way to
>         stop it and backup before the load then use
>         
>         checkpoint;
>         checkpoint_interval(6000);
>         DB.DBA.UNIPROT_LOAD (),
>         checkpoint;
>         checkpoint_interval(60);
>         
>         
>         Best Regards,
>         Ivan Mikhailov,
>         OpenLink Software.
>         
>         On Wed, 2008-02-13 at 15:19 -0800, Kunal Patel wrote:
>         > Hi Ivan,
>         > 
>         > Thanks for the detailed response. I downloaded the Uniprot
>         KB from
>         >
>         ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ (I am
>         > using all the files except uniparc.rdf.gz and
>         uniref.rdf.gz) 
>         > The relation between the various files is documented at
>         > http://dev.isb-sib.ch/projects/uniprot-rdf/intro.html
>         > 
>         > Again to make sure that I understood you correctly, the best
>         way to
>         > load the uniprot data for me would be to create a procedure
>         similar to
>         > LUBM_LOAD_LOG2 (say UNIPROT_LOAD_LOG2) and call this
>         procedure as
>         > follows,
>         > 
>         > UNIPROT_LOAD_LOG2 (vector ('data-dir'), 3);
>         > 
>         > This will use 3 processing threads per parsing.
>         > 
>         > Regards,
>         > Kunal
>         > 
>         > Ivan Mikhailov wrote:
>         > Kunal,
>         > 
>         > I've downloaded uniprit_sprot.xml.gz (442729K) and
>         > unprot_trembl.xml.gz(2858M) . Both are from
>         > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/
>         that
>         > is
>         > unavailable for me ATM .
>         > Where can I get the rest? Should that files reside in a
>         single
>         > graph and
>         > be queries as a single big set of triples or they have
>         > different meaning
>         > and should be queried separately (i.e. the location of a
>         > triple is
>         > important for what does it mean, e.g. reviewed data are
>         > separated from
>         > dirty drafts)? I'm weak in proteins, but I'd like to be
>         ready
>         > to more
>         > UniProt-related queries because this data set is quite
>         > popular.
>         > 
>         > With only 4 CPUs single multithreaded parser can be the best
>         > choice.
>         > Note that the 'number of threads' parameter of
>         > DB.DBA.RDF_LOAD_RDFXML()
>         > mentions threads used to process data from file, an extra
>         > thread will
>         > read the text and parse it, so for 4 CPU cores there's no
>         need
>         > in
>         > parameter value greater than 3. Three processing threads per
>         > one parsing
>         > tread is usually good ratio because parsing is usually three
>         > times
>         > faster than the rest of loading so CPU loading is well
>         > balanced. I'm
>         > using 2 x Quad Xeon so I will choose between 8
>         single-threaded
>         > parsers
>         > or 2 parsers with 3 processing threads each. With 4 cores
>         you
>         > may simply
>         > load file after file with 3 processing threads.
>         > 
>         > The most important performance tuning thing is to ensure
>         that
>         > you have
>         > set proper
>         > 
>         > NumberOfBuffers = 1000000
>         > MaxDirtyBuffers = 800000
>         > MaxCheckpointRemap = 1000000
>         > 
>         > in [Parameters] section of virtuoso configuration file
>         > (virtuoso.ini or
>         > the like) .
>         > 
>         > (Note for other readers: these numbers are reasonable for 16
>         > GB RAM
>         > Linux box, please refer to User's Guide before tweaking your
>         > settings)
>         > 
>         > You may note that 1 million of 8 kilobyte buffers is only 8
>         > Gb, leaving
>         > almost unused 8 Gb. This is done intentionally because some
>         > Linux
>         > installation demonstrated running out of OS physical memory
>         > due to
>         > fragmentation if almost all memory is allocated only once
>         and
>         > never
>         > re-allocated during the run. It seems to be Linux-specific
>         > problem of
>         > memory allocator, at least during long data loading we've
>         seen
>         > cases of
>         > stable size of the virtuoso process, zero activity of other
>         > processes
>         > and decreasing amount of available memory. We have no
>         accurate
>         > explanation and workaround for this phenomenon ATM. When
>         there
>         > are no
>         > such massive operations as loading huge database, I set up
>         to
>         > 
>         > NumberOfBuffers = 1500000
>         > MaxDirtyBuffers = 1200000
>         > MaxCheckpointRemap = 1500000
>         > 
>         > and it's still OK. Thus after loading all data you may wish
>         to
>         > shutdown,
>         > tweak and start server again.
>         > 
>         > If you have ext2fs or ext3fs filesystem then it's better to
>         > have enough
>         > free space on disk to not make it more than 80% full. When
>         > it's almost
>         > full it may allocate database file badly, resulting in
>         > measurable loss
>         > of disk access speed. That is not Virtuoso-specific fact,
>         but
>         > a common
>         > hint for all database-like applications with random access
>         to
>         > big files.
>         > 
>         > Best Regards,
>         > Ivan Mikhailov,
>         > OpenLink Software.
>         > 
>         > On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote:
>         > > Hi Ivan,
>         > > 
>         > > I am working with a 4 CPU machine with 16 GB RAM. The
>         > UniProt data
>         > > is distributed in 9 RDF files and 1 OWL file. 
>         > > 
>         > > The OWL file will act as the rule set for the RDF data.
>         Most
>         > of the
>         > > RDF files are of reasonable size, except one which is of
>         > size 41 GB.
>         > > Do you have any suggestion on what load method
>         > (multithreaded parsers
>         > > OR asynchronous queue of singe threaded parsers) would be
>         > best for
>         > > this dataset.
>         > > 
>         > > Thanks,
>         > > Kunal
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         >
>         ______________________________________________________________________
>         > Never miss a thing. Make Yahoo your homepage. 
>         >
>         
> -------------------------------------------------------------------------
>         > This SF.net email is sponsored by: Microsoft
>         > Defy all challenges. Microsoft(R) Visual Studio 2008.
>         > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>         > _______________________________________________
>         Virtuoso-devel mailing list
>         [email protected]
>         https://lists.sourceforge.net/lists/listinfo/virtuoso-devel
>         
> 
> 
> 
> 
> ______________________________________________________________________
> Never miss a thing. Make Yahoo your homepage.

Re: [Virtuoso-devel] loading uniprot in Virtuoso

Reply via email to