Kunal,

I've downloaded uniprit_sprot.xml.gz (442729K) and
unprot_trembl.xml.gz(2858M) . Both are from
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/ that is
unavailable for me ATM .
Where can I get the rest? Should that files reside in a single graph and
be queries as a single big set of triples or they have different meaning
and should be queried separately (i.e. the location of a triple is
important for what does it mean, e.g. reviewed data are separated from
dirty drafts)? I'm weak in proteins, but I'd like to be ready to more
UniProt-related queries because this data set is quite popular.

With only 4 CPUs single multithreaded parser can be the best choice.
Note that the 'number of threads' parameter of DB.DBA.RDF_LOAD_RDFXML()
mentions threads used to process data from file, an extra thread will
read the text and parse it, so for 4 CPU cores there's no need in
parameter value greater than 3. Three processing threads per one parsing
tread is usually good ratio because parsing is usually three times
faster than the rest of loading so CPU loading is well balanced. I'm
using 2 x Quad Xeon so I will choose between 8 single-threaded parsers
or 2 parsers with 3 processing threads each. With 4 cores you may simply
load file after file with 3 processing threads.

The most important performance tuning thing is to ensure that you have
set proper

NumberOfBuffers = 1000000
MaxDirtyBuffers = 800000
MaxCheckpointRemap = 1000000

in [Parameters] section of virtuoso configuration file (virtuoso.ini or
the like) .

(Note for other readers: these numbers are reasonable for 16 GB RAM
Linux box, please refer to User's Guide before tweaking your settings)

You may note that 1 million of 8 kilobyte buffers is only 8 Gb, leaving
almost unused 8 Gb. This is done intentionally because some Linux
installation demonstrated running out of OS physical memory due to
fragmentation if almost all memory is allocated only once and never
re-allocated during the run. It seems to be Linux-specific problem of
memory allocator, at least during long data loading we've seen cases of
stable size of the virtuoso process, zero activity of other processes
and decreasing amount of available memory. We have no accurate
explanation and workaround for this phenomenon ATM. When there are no
such massive operations as loading huge database, I set up to

NumberOfBuffers = 1500000
MaxDirtyBuffers = 1200000
MaxCheckpointRemap = 1500000

and it's still OK. Thus after loading all data you may wish to shutdown,
tweak and start server again.

If you have ext2fs or ext3fs filesystem then it's better to have enough
free space on disk to not make it more than 80% full. When it's almost
full it may allocate database file badly, resulting in measurable loss
of disk access speed. That is not Virtuoso-specific fact, but a common
hint for all database-like applications with random access to big files.

Best Regards,
Ivan Mikhailov,
OpenLink Software.

On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote:
> Hi Ivan,
> 
>   I am working with a 4 CPU machine with 16 GB RAM.  The UniProt data
> is distributed in 9 RDF files and 1 OWL file.  
> 
>   The OWL file will act as the rule set for the RDF data.  Most of the
> RDF files are of reasonable size, except one which is of size 41 GB.
> Do you have any suggestion on what load method (multithreaded parsers
> OR asynchronous queue of singe threaded parsers) would be best for
> this dataset.
> 
> Thanks,
> Kunal



Reply via email to