Thanks,

If you implement something with compression, tell me, I would be happy
to test it because it is a problem for me :)

Idea:
- It could be a rule to compress every literal (but indexation would
need to be done before its compress)
- Il could be a compression based on a selection of predicate. For
example, I want every literal that have the predicate ncbi:sequence to
be compress

Bye !!

Marc-Alexandre

2009/10/1 Ivan Mikhailov <imikhai...@openlinksw.com>:
>> If I understand correctly, B3S is the Billions triple challenge
>> dataset, correct? Located here http://challenge.semanticweb.org/ . It
>> is stated that it is 1.14 Billions statements.
>
> Yes B3S was 1.14G, but we have some additional data (but still much less
> than your 10G).
>
>> The currents amounts of statements for the Genbank N3 Dump is
>> 6,561,103,030 triples
>> For Refseq its 3,299,862,816 triples
>>
>> So its between 3 to 6 times bigger. But the most problematic part is
>> the size of some literals, because they can be complete genome
>> sequence in a single literal (hundreds of kilobytes to megabyte).
>
> I agree, B3S and other LOD resources have only relatively short
> literals.
>
>> The compression ratio of the N3 files with gzip is 1:10. The current
>> size of the virtuoso.db look like the UNcompressed size of what I've
>> currently succeed to load in it. So I'm worrying I will need to have 1
>> terrabytes of free space to build these triplestore. The use for
>> compression is to be able to store gygabytes of data in a smaller
>> space that won't be use in indexing, searching or anything except to
>> fetch them when needed. I'm using Virtuoso 6 TP2.
>
> Virtuoso 6 compacts pages up to 50% of original size. But you usually
> have _two_ copies of each page in the database: in normal place and
> remapped. That is used to recover the database after crash: if a page
> has changed since last checkpoint then the database contains the state
> of the page at checkpoint time and the log replay restores all changes
> on it that were made after the checkpoint.
>
> For hundreds of kilobytes per object, a gz compression would be nice, of
> course, but there's no such built-in feature in RDF storage (even if
> there are ready-to-use gz_compress and gz_uncompress functions in
> Virtuoso/PL). I'll think about it.
>
>> I've set my loging to /dev/null because the server didn't want to
>> start without something at the line TransactionFile
>
> Strange, I'll check that but not right now, I'm afraid.
>
>> Has for stripping, I'm currently loading on multiple on a disk array
>> of 5 physical hard drive, but a single logical disk. So I don't think
>> I will have gain from stripping. From Virtuoso web site ("Striping
>> only has a potential performance benefit when striped across multiple
>> devices"). Do you think I should try to strip anyway?
>
> No, you should have one IO queue per independently controlled harddrive.
> Having 5 physical hard drives mounted individually, you would be able to
> stripe on 5 and thus get a bit better performance than OS can provide,
> but the difference is small enough if you have plenty of RAM and OS disk
> buffers are hundreds of megabytes to gigabytes. So just keep running. FS
> type plays some role (xfs is preferable) but in most of cases you may
> ignore the difference, esp. if disks are far from being full.
>
>
>

Reply via email to