Re: Strategies for loading large (>500m triples) datasets

Andy Seaborne Thu, 01 Mar 2012 03:50:08 -0800

The TDB bulk loaders all work on the assumption they are building a newdatabase from empty. You give all the files at once to the bulkloader.

The other point that is worth mentioned when working with a lot of datais to make sure that the data is all valid - if some RDF is broken butonly encountered late in the load,then a lot of time is used only to bethrown away.

The final general point - find a large-RAM server machine and loadthere. Databases are portable.

The dataset (provided by someone else) is split into about 80000
individual RDF/XML files, all in one directory by default. This makes
even straightforward directory listings slow, and means that the
command-line length/maximum number of arguments is exceeded, so I
can't just refer to *.rdf.

What a nuisance! But given it's a good idea to check all the datafirst, you might a well parse each file to a single or few NTriplesfiles then load from that.


So I suggest step one is parse all the file to get an N-Triples working set.

for each directory :
        riot FILE1 FILE2 ... >> SOMEWHERE/data.nt

then keep (compressed Ntriples compresses x8-x10 using gzip).

My first approach has been to create separate directories, each with
about a third of the files, and use tdbloader to load each group in
turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
machine) and it took four hours to load and index the first group of
files, a total of 207m triples in total. As Andy mentioned in a thread
yesterday, the triples/sec count gradually declined over the course of
the import (from about 30k/sec to 24k/sec).

You don't need to give the loader more RAM - on 64 bit hardware/java ituses OS mamory mapped files, not java heap. 2G should be enough - 1.2Gis probably OK (it depends a bit on size of literals)

However when I tried to use tdbloader to load the next group of files
into the same TDB, I found that performance declined dramatically -
down to about 400 triples/sec right from the start. Is this expected
behaviour?I wonder if it's because it's trying to add new data to an
already indexed set - is this the case, and if so is there any way to
improve the performance?

You're right - incremental the load (which is more for adding relativesmall amounts of data to an existing large dataset) is slower for bulkloading.

Coming from a relational database background,
my instinct would be to postpone indexing until all the triples were
loaded (i.e. after the third group of files was imported), however I
couldn't see any options affecting the index creation in tdbloader.

There isn't such an option - tdbloader sorts it out into "load fromempty" when it does play games with index creation and "loading existingdata" when it does not. Could be done - but trying not ot makeeverything really complicated (like the SQL world :-)

Another question is whether the strategy I've adopted (i.e. loading 3
groups of ~27k files consecutively) is the correct one. The
alternative would be to merge all 80k files into one in a separate
step, then load the resulting humongous file. I suspect that there
would be different issues with that approach.

Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
instance be better? Or three separate TDB instances? Obviously the
later would require some sort of query federation layer.


SDB does not scale as well.

If there is a natural split to 3 (say) servers, then that might work butwithout a clever query federation layer the details get exposed to theapplication.

I'm relatively new to this whole area so any tips on best practice
would be appreciated.

Regards

Glenn.


Let us know how you get on,

I've loaded over 1 billion triples - yes, it takes quite a long time butit gets there.


        Andy

(Paolo - what's the state of your parallel loader work?)

Re: Strategies for loading large (>500m triples) datasets

Reply via email to