Kunal,
I've downloaded UNPRO files so I'll try the loading on my local box.
Meanwhile could you please send me details of your build (OS, compiler,
glibc, compilation flags, the content of Makefile in the topmost
directory of the build), because the error is internal, not an error in
stored procedure or input data.
Best Regards,
Ivan Mikhailov,
OpenLink Software.
On Mon, 2008-02-18 at 10:44 -0800, Kunal Patel wrote:
> Hi Ivan,
>
> I tried loading the Uniprot data (RDF file, ~40 GB) into Virtuoso,
> but it failed (I think the server just died). I am attaching the log
> file with this message.
>
> I used the following procedure to load the data,
>
> create table UNIPROT_LOAD_TIME (
> FILE varchar not null, -- Source file name.
> START INTEGER not null, -- Time when the loading is started.
> FINISH INTEGER not null, -- Time when the loading is finished.
> TIME_SPENT INTEGER not null ); -- Total time spent for loading
>
> delete from UNIPROT_LOAD_TIME;
>
> create procedure DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
> {
> declare start, finish, time_spent integer;
>
> total_start := msec_time();
> start := msec_time();
>
> DB.DBA.RDF_LOAD_RDFXML_MT(file_to_string_output('/uniprot/uniprot.rdf'),
> 'http://uniprot', 'http://uniprot_graph', log_mode, 3);
> finish := msec_time();
> time_spent := finish - start;
> insert into UNIPROT_LOAD_TIME (FILE, START, FINISH, TIME_SPENT)
> values ('uniprot.rdf', start, finish, time_spent);
>
> };
>
> checkpoint;
> checkpoint_interval(6000);
> DB.DBA.UNIPROT_LOAD (0);
> checkpoint;
> checkpoint_interval(60);
>
>
> I also modified the virtuoso.ini setting the following values
> NumberOfBuffers = 1000000
> MaxDirtyBuffers = 800000
> MaxCheckpointRemap = 1000000
>
> Once again the config of the machine that I used is a 4 CPU Linux box
> with 16 GB RAM.
>
> Let me know if you think I am doing something wrong.
>
> Regards,
> Kunal
>
>
>
> Ivan Mikhailov <[email protected]> wrote:
> Kunal,
>
> No, LUBM_LOAD_LOG2 uses single-threaded parsers in parallel.
> It's OK for
> big number of files and big number of CPU cores because it can
> load all
> cores without much lock contention. For UNIPROT case, it's
> probably
> enough to
>
> create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1)
> {
> DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),
> 'http://base_uri_1', 'destination_graph_1', log_mode, 3);
> DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),
> 'http://base_uri_2', 'destination_graph_2', log_mode, 3);
> ...
> DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),
> 'http://base_uri_9', 'destination_graph_9', log_mode, 3);
> }
>
> If you're starting from blank database and you can drop it and
> re-create
> in case of error signalled, use it this way:
>
> checkpoint;
> checkpoint_interval(6000);
> DB.DBA.UNIPROT_LOAD (0),
> checkpoint;
> checkpoint_interval(60);
>
> If the database contains important data already and there's no
> way to
> stop it and backup before the load then use
>
> checkpoint;
> checkpoint_interval(6000);
> DB.DBA.UNIPROT_LOAD (),
> checkpoint;
> checkpoint_interval(60);
>
>
> Best Regards,
> Ivan Mikhailov,
> OpenLink Software.
>
> On Wed, 2008-02-13 at 15:19 -0800, Kunal Patel wrote:
> > Hi Ivan,
> >
> > Thanks for the detailed response. I downloaded the Uniprot
> KB from
> >
> ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/ (I am
> > using all the files except uniparc.rdf.gz and
> uniref.rdf.gz)
> > The relation between the various files is documented at
> > http://dev.isb-sib.ch/projects/uniprot-rdf/intro.html
> >
> > Again to make sure that I understood you correctly, the best
> way to
> > load the uniprot data for me would be to create a procedure
> similar to
> > LUBM_LOAD_LOG2 (say UNIPROT_LOAD_LOG2) and call this
> procedure as
> > follows,
> >
> > UNIPROT_LOAD_LOG2 (vector ('data-dir'), 3);
> >
> > This will use 3 processing threads per parsing.
> >
> > Regards,
> > Kunal
> >
> > Ivan Mikhailov wrote:
> > Kunal,
> >
> > I've downloaded uniprit_sprot.xml.gz (442729K) and
> > unprot_trembl.xml.gz(2858M) . Both are from
> > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/
> that
> > is
> > unavailable for me ATM .
> > Where can I get the rest? Should that files reside in a
> single
> > graph and
> > be queries as a single big set of triples or they have
> > different meaning
> > and should be queried separately (i.e. the location of a
> > triple is
> > important for what does it mean, e.g. reviewed data are
> > separated from
> > dirty drafts)? I'm weak in proteins, but I'd like to be
> ready
> > to more
> > UniProt-related queries because this data set is quite
> > popular.
> >
> > With only 4 CPUs single multithreaded parser can be the best
> > choice.
> > Note that the 'number of threads' parameter of
> > DB.DBA.RDF_LOAD_RDFXML()
> > mentions threads used to process data from file, an extra
> > thread will
> > read the text and parse it, so for 4 CPU cores there's no
> need
> > in
> > parameter value greater than 3. Three processing threads per
> > one parsing
> > tread is usually good ratio because parsing is usually three
> > times
> > faster than the rest of loading so CPU loading is well
> > balanced. I'm
> > using 2 x Quad Xeon so I will choose between 8
> single-threaded
> > parsers
> > or 2 parsers with 3 processing threads each. With 4 cores
> you
> > may simply
> > load file after file with 3 processing threads.
> >
> > The most important performance tuning thing is to ensure
> that
> > you have
> > set proper
> >
> > NumberOfBuffers = 1000000
> > MaxDirtyBuffers = 800000
> > MaxCheckpointRemap = 1000000
> >
> > in [Parameters] section of virtuoso configuration file
> > (virtuoso.ini or
> > the like) .
> >
> > (Note for other readers: these numbers are reasonable for 16
> > GB RAM
> > Linux box, please refer to User's Guide before tweaking your
> > settings)
> >
> > You may note that 1 million of 8 kilobyte buffers is only 8
> > Gb, leaving
> > almost unused 8 Gb. This is done intentionally because some
> > Linux
> > installation demonstrated running out of OS physical memory
> > due to
> > fragmentation if almost all memory is allocated only once
> and
> > never
> > re-allocated during the run. It seems to be Linux-specific
> > problem of
> > memory allocator, at least during long data loading we've
> seen
> > cases of
> > stable size of the virtuoso process, zero activity of other
> > processes
> > and decreasing amount of available memory. We have no
> accurate
> > explanation and workaround for this phenomenon ATM. When
> there
> > are no
> > such massive operations as loading huge database, I set up
> to
> >
> > NumberOfBuffers = 1500000
> > MaxDirtyBuffers = 1200000
> > MaxCheckpointRemap = 1500000
> >
> > and it's still OK. Thus after loading all data you may wish
> to
> > shutdown,
> > tweak and start server again.
> >
> > If you have ext2fs or ext3fs filesystem then it's better to
> > have enough
> > free space on disk to not make it more than 80% full. When
> > it's almost
> > full it may allocate database file badly, resulting in
> > measurable loss
> > of disk access speed. That is not Virtuoso-specific fact,
> but
> > a common
> > hint for all database-like applications with random access
> to
> > big files.
> >
> > Best Regards,
> > Ivan Mikhailov,
> > OpenLink Software.
> >
> > On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote:
> > > Hi Ivan,
> > >
> > > I am working with a 4 CPU machine with 16 GB RAM. The
> > UniProt data
> > > is distributed in 9 RDF files and 1 OWL file.
> > >
> > > The OWL file will act as the rule set for the RDF data.
> Most
> > of the
> > > RDF files are of reasonable size, except one which is of
> > size 41 GB.
> > > Do you have any suggestion on what load method
> > (multithreaded parsers
> > > OR asynchronous queue of singe threaded parsers) would be
> > best for
> > > this dataset.
> > >
> > > Thanks,
> > > Kunal
> >
> >
> >
> >
> >
> >
> >
> ______________________________________________________________________
> > Never miss a thing. Make Yahoo your homepage.
> >
>
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> Virtuoso-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/virtuoso-devel
>
>
>
>
>
> ______________________________________________________________________
> Looking for last minute shopping deals? Find them fast with Yahoo!
> Search.
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________ Virtuoso-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/virtuoso-devel