Hello, I've a huge amount a triples in N3 format and some triples can have very large literals (Complete Genbank in N3 and some literals are complete sequences). The total amount of triples is above 6 billions and uncompressed data is around 700 gigabytes. I just can't find a way to completely load it into Virtuoso.
I've tryed the provided script to load Bio2RDF data located here http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfloading, but even if it is visibly working, the server always eventually do a segmentation fault. And since this script don't provide output to know which triple he was trying to load before the crash happen, I can't restart it at the point where it was. We made our own version of this script using Perl and the TTLP_MT procedure. With our version, we know where it was before a crash, so we can start it again from that point. The important point to notice here, is that server also crash at repetition after I restart the load and the time between 2 crashs grow smaller each time. Eventually, it become impossible to continue the load. I've try to do some tuning in the Virtuoso.ini myself, but I've only manage to get minor speed boost. Should I remove all indexes for the loading or add more? The server use to do the loading is a 32 cores, 128 GB ram and a Raid array of 1.5 TB. I can provide you with the Genbank dump if you want to play with it. If so, just tell me and I will give you the link and what correction need to be done for loading it (There was some errors with the RDFizer that generate the Genbank dump, but they are easily corrected with a regexp before the load). So do you have an optimize virtuoso.ini you can suggest me to help? Thanks, Marc-Alexandre Nolin P.S.: We have other large N3 dump (even more than twice the size of this one) that we can't manage to succeed at loading them. But I think that if we can solve the problem with this dump, it will also work with the others dumps.
