Re: Strategies for loading large (>500m triples) datasets

Paolo Castagna Thu, 01 Mar 2012 09:05:33 -0800

Andy Seaborne wrote:
> (Paolo - what's the state of your parallel loader work?)


Both tdbloader3 [1] and tdbloader4 [2] are (should be?) correct,
I've been testing them with datasets in the 500-700 million triples
range but I consider them (still) *experimental*.

Having someone else giving tdbloader3 a try would be good though.

We have been experiencing a few problems with tdbloader4 but I am
not sure if it is because of tdbloader4 or because of the Amazon EMR
environment we are using.

I have some code to convert Freebase dumps in RDF, it's ~600 million
triples, I'll use that to gather some numbers. Ideally, comparing
tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of
time and costs).

Paolo

 [1] http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader3/trunk/
 [2] https://github.com/castagna/tdbloader4

Re: Strategies for loading large (>500m triples) datasets

Reply via email to