Bulk loading [was: Loading Large TripleStore using TDB]

Andy Seaborne Thu, 28 Feb 2013 09:45:05 -0800

On 26/02/13 19:09, Dominique Brezinski wrote:

This is an interesting subject. What are the steps done and how is the work
partitioned when loading data? I ask because if everything goes well, I
will be loading tens of billions of triples a day into a Redshift-backed
SDB store. Using SQL to load into Redshift is inefficient, so I plan to
prep the data for insertion into the temp tables into flat files that are
written into S3, then use the COPY command for distributed load. From there
the standard INSERT INTO .... SELECT pattern can be used for the Nodes and
Triples/Quads tables. I want to leverage all the existing code to process
the triples, then instead of directly inserting, write batches to S3. S3
really likes 32-64MB chunks.


I will read the code, but if anybody has thoughts on the pattern, or some
hard-won tips, it would be much appreciated.

Dom


Hi Dom,

There are certainly things you can do to bulk load more efficiently thannaive insertion.


Thoughts, nothing deep:

It might be worth using the naive layout of triple table with stringsfor subject/predicate/object.

If you use "layout 2" style, hash is better because it does not requireglobal coordination to allocate node ids. If you want to use indexnumbers, some kind of block allocation scheme will work (allocate blocksof ids to a processor and it can allocate within that block quickly).

The default hash is only 64 bits, which is good up to aroud a billionitems IIRC before the probability of a clash starts rising. You'llpossibly have to add some bits here.

The prepare-and-load scheme you outline is going to be far better to dolarge scale loads if SQL is load - SDB over SQL is going to be worse.


Do let me know how you get on - sounds interesting.

        Andy

Bulk loading [was: Loading Large TripleStore using TDB]

Reply via email to