On 26/02/13 19:09, Dominique Brezinski wrote:
This is an interesting subject. What are the steps done and how is the work
partitioned when loading data? I ask because if everything goes well, I
will be loading tens of billions of triples a day into a Redshift-backed
SDB store. Using SQL to load into Redshift is inefficient, so I plan to
prep the data for insertion into the temp tables into flat files that are
written into S3, then use the COPY command for distributed load. From there
the standard INSERT INTO .... SELECT pattern can be used for the Nodes and
Triples/Quads tables. I want to leverage all the existing code to process
the triples, then instead of directly inserting, write batches to S3. S3
really likes 32-64MB chunks.
I will read the code, but if anybody has thoughts on the pattern, or some
hard-won tips, it would be much appreciated.
Dom
Hi Dom,
There are certainly things you can do to bulk load more efficiently than
naive insertion.
Thoughts, nothing deep:
It might be worth using the naive layout of triple table with strings
for subject/predicate/object.
If you use "layout 2" style, hash is better because it does not require
global coordination to allocate node ids. If you want to use index
numbers, some kind of block allocation scheme will work (allocate blocks
of ids to a processor and it can allocate within that block quickly).
The default hash is only 64 bits, which is good up to aroud a billion
items IIRC before the probability of a clash starts rising. You'll
possibly have to add some bits here.
The prepare-and-load scheme you outline is going to be far better to do
large scale loads if SQL is load - SDB over SQL is going to be worse.
Do let me know how you get on - sounds interesting.
Andy