Re: Strategies for loading large (>500m triples) datasets

Sarven Capadisli Thu, 01 Mar 2012 05:33:50 -0800

On 12-03-01 06:49 AM, Andy Seaborne wrote:

The dataset (provided by someone else) is split into about 80000
individual RDF/XML files, all in one directory by default. This makes
even straightforward directory listings slow, and means that the
command-line length/maximum number of arguments is exceeded, so I
can't just refer to *.rdf.


What a nuisance! But given it's a good idea to check all the data first,
you might a well parse each file to a single or few NTriples files then
load from that.

So I suggest step one is parse all the file to get an N-Triples working
set.

for each directory :
riot FILE1 FILE2 ... >> SOMEWHERE/data.nt

I have tested this approach for the past two days. Although I expect thelarge N-Triples files to load faster, I can't say I see an improvementover the smaller RDF/XML files. It was due to this reason that I postedmy initial email to the list on batch counts.


Just to put things in perspective:

N-Triples files: from 7MB to 12 GB

RDF/XML files: 7MB to 15 MB files

The load count is all over the place. Sometimes it appears to load themup fast (for either method), and sometimes not. I haven't been able toidentify the core reason behind this.

As I've said, I use tdbloader and I probably won't get a chance to trytdbloader2/3 until I'm through this.


-Sarven

Re: Strategies for loading large (>500m triples) datasets

Reply via email to