Kunal, I've downloaded uniprit_sprot.xml.gz (442729K) and unprot_trembl.xml.gz(2858M) . Both are from ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/ that is unavailable for me ATM . Where can I get the rest? Should that files reside in a single graph and be queries as a single big set of triples or they have different meaning and should be queried separately (i.e. the location of a triple is important for what does it mean, e.g. reviewed data are separated from dirty drafts)? I'm weak in proteins, but I'd like to be ready to more UniProt-related queries because this data set is quite popular.
With only 4 CPUs single multithreaded parser can be the best choice. Note that the 'number of threads' parameter of DB.DBA.RDF_LOAD_RDFXML() mentions threads used to process data from file, an extra thread will read the text and parse it, so for 4 CPU cores there's no need in parameter value greater than 3. Three processing threads per one parsing tread is usually good ratio because parsing is usually three times faster than the rest of loading so CPU loading is well balanced. I'm using 2 x Quad Xeon so I will choose between 8 single-threaded parsers or 2 parsers with 3 processing threads each. With 4 cores you may simply load file after file with 3 processing threads. The most important performance tuning thing is to ensure that you have set proper NumberOfBuffers = 1000000 MaxDirtyBuffers = 800000 MaxCheckpointRemap = 1000000 in [Parameters] section of virtuoso configuration file (virtuoso.ini or the like) . (Note for other readers: these numbers are reasonable for 16 GB RAM Linux box, please refer to User's Guide before tweaking your settings) You may note that 1 million of 8 kilobyte buffers is only 8 Gb, leaving almost unused 8 Gb. This is done intentionally because some Linux installation demonstrated running out of OS physical memory due to fragmentation if almost all memory is allocated only once and never re-allocated during the run. It seems to be Linux-specific problem of memory allocator, at least during long data loading we've seen cases of stable size of the virtuoso process, zero activity of other processes and decreasing amount of available memory. We have no accurate explanation and workaround for this phenomenon ATM. When there are no such massive operations as loading huge database, I set up to NumberOfBuffers = 1500000 MaxDirtyBuffers = 1200000 MaxCheckpointRemap = 1500000 and it's still OK. Thus after loading all data you may wish to shutdown, tweak and start server again. If you have ext2fs or ext3fs filesystem then it's better to have enough free space on disk to not make it more than 80% full. When it's almost full it may allocate database file badly, resulting in measurable loss of disk access speed. That is not Virtuoso-specific fact, but a common hint for all database-like applications with random access to big files. Best Regards, Ivan Mikhailov, OpenLink Software. On Wed, 2008-02-13 at 10:47 -0800, Kunal Patel wrote: > Hi Ivan, > > I am working with a 4 CPU machine with 16 GB RAM. The UniProt data > is distributed in 9 RDF files and 1 OWL file. > > The OWL file will act as the rule set for the RDF data. Most of the > RDF files are of reasonable size, except one which is of size 41 GB. > Do you have any suggestion on what load method (multithreaded parsers > OR asynchronous queue of singe threaded parsers) would be best for > this dataset. > > Thanks, > Kunal
