Re: [Virtuoso-devel] loading uniprot in Virtuoso

Ivan Mikhailov Thu, 14 Feb 2008 11:46:20 -0800

Kunal,

msec_time() returns a timer with 1 millisecond resolution, so the
following will work:


declare start, finish, time_spent integer;
start := msec_time();
do_something;
finish := msec_time();
time_spent := finish - start;

There's function now() that returns millisecond counter as well but it
returns current transaction timestamp, not an accurate physical time.

Best Regards,
Ivan Mikhailov,
OpenLink Software.

On Thu, 2008-02-14 at 10:46 -0800, Kunal Patel wrote:
> Thanks Ivan,
> 
>   I also want to collect statistics on how much time is taken in
> loading each file and the overall time spent in loading the whole
> dataset.  Is there an easy way to do that?
> 
> Regards,
> Kunal
> 
> 
> 
> Ivan Mikhailov <[email protected]> wrote:
>         Kunal,
>         
>         No, LUBM_LOAD_LOG2 uses single-threaded parsers in parallel.
>         It's OK for
>         big number of files and big number of CPU cores because it can
>         load all
>         cores without much lock contention. For UNIPROT case, it's
>         probably
>         enough to
>         
>         create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
>         {
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),
>         'http://base_uri_1', 'destination_graph_1', log_mode, 3);
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),
>         'http://base_uri_2', 'destination_graph_2', log_mode, 3);
>         ...
>         DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),
>         'http://base_uri_9', 'destination_graph_9', log_mode, 3);
>         }
>         
>         If you're starting from blank database and you can drop it and
>         re-create
>         in case of error signalled, use it this way:
>         
>         checkpoint;
>         checkpoint_interval(6000);
>         DB.DBA.UNIPROT_LOAD (0),
>         checkpoint;
>         checkpoint_interval(60);
>         
>         If the database contains important data already and there's no
>         way to
>         stop it and backup before the load then use
>         
>         checkpoint;
>         checkpoint_interval(6000);
>         DB.DBA.UNIPROT_LOAD (),
>         checkpoint;
>         checkpoint_interval(60);
>         
>         
>         Best Regards,
>         Ivan Mikhailov,
>         OpenLink Software.
>         
>         On Wed, 2008-02-13 at 15:19 -0800, Kunal Patel wrote:
>         > Hi Ivan,
>         > 
>         > Thanks for the detailed response. I downloaded the Uniprot
>         KB from
>         >
>         ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ (I am
>         > using all the files except uniparc.rdf.gz and
>         uniref.rdf.gz) 
>         > The relation between the various files is documented at
>         > http://dev.isb-sib.ch/projects/uniprot-rdf/intro.html
>         > 
>         > Again to make sure that I understood you correctly, the best
>         way to
>         > load the uniprot data for me would be to create a procedure
>         similar to
>         > LUBM_LOAD_LOG2 (say UNIPROT_LOAD_LOG2) and call this
>         procedure as
>         > follows,
>         > 
>         > UNIPROT_LOAD_LOG2 (vector ('data-dir'), 3);
>         > 
>         > This will use 3 processing threads per parsing.
>         > 
>         > Regards,
>         > Kunal
>         > 
>         > Ivan Mikhailov wrote:
>         > Kunal,
>         > 
>         > I've downloaded uniprit_sprot.xml.gz (442729K) and
>         > unprot_trembl.xml.gz(2858M) . Both are from
>         > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/
>         that
>         > is
>         > unavailable for me ATM .
>         > Where can I get the rest? Should that files reside in a
>         single
>         > graph and
>         > be queries as a single big set of triples or they have
>         > different meaning
>         > and should be queried separately (i.e. the location of a
>         > triple is
>         > important for what does it mean, e.g. reviewed data are
>         > separated from
>         > dirty drafts)? I'm weak in proteins, but I'd like to be
>         ready
>         > to more
>         > UniProt-related queries because this data set is quite
>         > popular.
>         > 
>         > With only 4 CPUs single multithreaded parser can be the best
>         > choice.
>         > Note that the 'number of threads' parameter of
>         > DB.DBA.RDF_LOAD_RDFXML()
>         > mentions threads used to process data from file, an extra
>         > thread will
>         > read the text and parse it, so for 4 CPU cores there's no
>         need
>         > in
>         > parameter value greater than 3. Three processing threads per
>         > one parsing
>         > tread is usually good ratio because parsing is usually three
>         > times
>         > faster than the rest of loading so CPU loading is well
>         > balanced. I'm
>         > using 2 x Quad Xeon so I will choose between 8
>         single-threaded
>         > parsers
>         > or 2 parsers with 3 processing threads each. With 4 cores
>         you
>         > may simply
>         > load file after file with 3 processing threads.
>         > 
>         > The most important performance tuning thing is to ensure
>         that
>         > you have
>         > set proper
>         > 
>         > NumberOfBuffers = 1000000
>         > MaxDirtyBuffers = 800000
>         > MaxCheckpointRemap = 1000000
>         > 
>         > in [Parameters] section of virtuoso configuration file
>         > (virtuoso.ini or
>         > the like) .
>         > 
>         > (Note for other readers: these numbers are reasonable for 16
>         > GB RAM
>         > Linux box, please refer to User's Guide before tweaking your
>         > settings)
>         > 
>         > You may note that 1 million of 8 kilobyte buffers is only 8
>         > Gb, leaving
>         > almost unused 8 Gb. This is done intentionally because some
>         > Linux
>         > installation demonstrated running out of OS physical memory
>         > due to
>         > fragmentation if almost all memory is allocated only once
>         and
>         > never
>         > re-allocated during the run. It seems to be Linux-specific
>         > problem of
>         > memory allocator, at least during long data loading we've
>         seen
>         > cases of
>         > stable size of the virtuoso process, zero activity of other
>         > processes
>         > and decreasing amount of available memory. We have no
>         accurate
>         > explanation and workaround for this phenomenon ATM. When
>         there
>         > are no
>         > such massive operations as loading huge database, I set up
>         to
>         > 
>         > NumberOfBuffers = 1500000
>         > MaxDirtyBuffers = 1200000
>         > MaxCheckpointRemap = 1500000
>         > 
>         > and it's still OK. Thus after loading all data you may wish
>         to
>         > shutdown,
>         > tweak and start server again.
>         > 
>         > If you have ext2fs or ext3fs filesystem then it's better to
>         > have enough
>         > free space on disk to not make it more than 80% full. When
>         > it's almost
>         > full it may allocate database file badly, resulting in
>         > measurable loss
>         > of disk access speed. That is not Virtuoso-specific fact,
>         but
>         > a common
>         > hint for all database-like applications with random access
>         to
>         > big files.
>         > 
>         > Best Regards,
>         > Ivan Mikhailov,
>         > OpenLink Software.
>         > 
>         > On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote:
>         > > Hi Ivan,
>         > > 
>         > > I am working with a 4 CPU machine with 16 GB RAM. The
>         > UniProt data
>         > > is distributed in 9 RDF files and 1 OWL file. 
>         > > 
>         > > The OWL file will act as the rule set for the RDF data.
>         Most
>         > of the
>         > > RDF files are of reasonable size, except one which is of
>         > size 41 GB.
>         > > Do you have any suggestion on what load method
>         > (multithreaded parsers
>         > > OR asynchronous queue of singe threaded parsers) would be
>         > best for
>         > > this dataset.
>         > > 
>         > > Thanks,
>         > > Kunal
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         >
>         ______________________________________________________________________
>         > Never miss a thing. Make Yahoo your homepage. 
>         >
>         
> -------------------------------------------------------------------------
>         > This SF.net email is sponsored by: Microsoft
>         > Defy all challenges. Microsoft(R) Visual Studio 2008.
>         > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>         > _______________________________________________
>         Virtuoso-devel mailing list
>         [email protected]
>         https://lists.sourceforge.net/lists/listinfo/virtuoso-devel
>         
> 
> 
> 
> 
> ______________________________________________________________________
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try
> it now.

Re: [Virtuoso-devel] loading uniprot in Virtuoso

Reply via email to