The TDB bulk loaders all work on the assumption they are building a new
database from empty. You give all the files at once to the bulkloader.
The other point that is worth mentioned when working with a lot of data
is to make sure that the data is all valid - if some RDF is broken but
only encountered late in the load,then a lot of time is used only to be
thrown away.
The final general point - find a large-RAM server machine and load
there. Databases are portable.
The dataset (provided by someone else) is split into about 80000
individual RDF/XML files, all in one directory by default. This makes
even straightforward directory listings slow, and means that the
command-line length/maximum number of arguments is exceeded, so I
can't just refer to *.rdf.
What a nuisance! But given it's a good idea to check all the data
first, you might a well parse each file to a single or few NTriples
files then load from that.
So I suggest step one is parse all the file to get an N-Triples working set.
for each directory :
riot FILE1 FILE2 ... >> SOMEWHERE/data.nt
then keep (compressed Ntriples compresses x8-x10 using gzip).
My first approach has been to create separate directories, each with
about a third of the files, and use tdbloader to load each group in
turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
machine) and it took four hours to load and index the first group of
files, a total of 207m triples in total. As Andy mentioned in a thread
yesterday, the triples/sec count gradually declined over the course of
the import (from about 30k/sec to 24k/sec).
You don't need to give the loader more RAM - on 64 bit hardware/java it
uses OS mamory mapped files, not java heap. 2G should be enough - 1.2G
is probably OK (it depends a bit on size of literals)
However when I tried to use tdbloader to load the next group of files
into the same TDB, I found that performance declined dramatically -
down to about 400 triples/sec right from the start. Is this expected
behaviour?I wonder if it's because it's trying to add new data to an
already indexed set - is this the case, and if so is there any way to
improve the performance?
You're right - incremental the load (which is more for adding relative
small amounts of data to an existing large dataset) is slower for bulk
loading.
Coming from a relational database background,
my instinct would be to postpone indexing until all the triples were
loaded (i.e. after the third group of files was imported), however I
couldn't see any options affecting the index creation in tdbloader.
There isn't such an option - tdbloader sorts it out into "load from
empty" when it does play games with index creation and "loading existing
data" when it does not. Could be done - but trying not ot make
everything really complicated (like the SQL world :-)
Another question is whether the strategy I've adopted (i.e. loading 3
groups of ~27k files consecutively) is the correct one. The
alternative would be to merge all 80k files into one in a separate
step, then load the resulting humongous file. I suspect that there
would be different issues with that approach.
Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
instance be better? Or three separate TDB instances? Obviously the
later would require some sort of query federation layer.
SDB does not scale as well.
If there is a natural split to 3 (say) servers, then that might work but
without a clever query federation layer the details get exposed to the
application.
I'm relatively new to this whole area so any tips on best practice
would be appreciated.
Regards
Glenn.
Let us know how you get on,
I've loaded over 1 billion triples - yes, it takes quite a long time but
it gets there.
Andy
(Paolo - what's the state of your parallel loader work?)