Thanks, If you implement something with compression, tell me, I would be happy to test it because it is a problem for me :)
Idea: - It could be a rule to compress every literal (but indexation would need to be done before its compress) - Il could be a compression based on a selection of predicate. For example, I want every literal that have the predicate ncbi:sequence to be compress Bye !! Marc-Alexandre 2009/10/1 Ivan Mikhailov <imikhai...@openlinksw.com>: >> If I understand correctly, B3S is the Billions triple challenge >> dataset, correct? Located here http://challenge.semanticweb.org/ . It >> is stated that it is 1.14 Billions statements. > > Yes B3S was 1.14G, but we have some additional data (but still much less > than your 10G). > >> The currents amounts of statements for the Genbank N3 Dump is >> 6,561,103,030 triples >> For Refseq its 3,299,862,816 triples >> >> So its between 3 to 6 times bigger. But the most problematic part is >> the size of some literals, because they can be complete genome >> sequence in a single literal (hundreds of kilobytes to megabyte). > > I agree, B3S and other LOD resources have only relatively short > literals. > >> The compression ratio of the N3 files with gzip is 1:10. The current >> size of the virtuoso.db look like the UNcompressed size of what I've >> currently succeed to load in it. So I'm worrying I will need to have 1 >> terrabytes of free space to build these triplestore. The use for >> compression is to be able to store gygabytes of data in a smaller >> space that won't be use in indexing, searching or anything except to >> fetch them when needed. I'm using Virtuoso 6 TP2. > > Virtuoso 6 compacts pages up to 50% of original size. But you usually > have _two_ copies of each page in the database: in normal place and > remapped. That is used to recover the database after crash: if a page > has changed since last checkpoint then the database contains the state > of the page at checkpoint time and the log replay restores all changes > on it that were made after the checkpoint. > > For hundreds of kilobytes per object, a gz compression would be nice, of > course, but there's no such built-in feature in RDF storage (even if > there are ready-to-use gz_compress and gz_uncompress functions in > Virtuoso/PL). I'll think about it. > >> I've set my loging to /dev/null because the server didn't want to >> start without something at the line TransactionFile > > Strange, I'll check that but not right now, I'm afraid. > >> Has for stripping, I'm currently loading on multiple on a disk array >> of 5 physical hard drive, but a single logical disk. So I don't think >> I will have gain from stripping. From Virtuoso web site ("Striping >> only has a potential performance benefit when striped across multiple >> devices"). Do you think I should try to strip anyway? > > No, you should have one IO queue per independently controlled harddrive. > Having 5 physical hard drives mounted individually, you would be able to > stripe on 5 and thus get a bit better performance than OS can provide, > but the difference is small enough if you have plenty of RAM and OS disk > buffers are hundreds of megabytes to gigabytes. So just keep running. FS > type plays some role (xfs is preferable) but in most of cases you may > ignore the difference, esp. if disks are far from being full. > > >