Joshua,

If you're in a VM you have another layer trying to help and, in my experience, it does not. Sometimes, maliciously so. [*]

And sharing a real machine can mean you are contending for real resources like disk or memory bandwidth.

In that setup, whatever you do, make sure the VM is not swapping on the real machine. That will make memory mapped files very bad (a sort of double swapping effect).

Reading the N-triples file is unlikely to be the bottle neck. The VM should not make too much difference - only large chunks of stream read I/O is being done (though I have encountered VMs that seem to make that a cost).

You can test this with

riot --time --sink ... files ...

On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get 100K-150K triples/s BSBM from .gz.nt files. BSBM isn't the faster to parse as it has some moderately long literals (or "more bytes to copy" as the parser sees it).

A bit faster from .nt but not enough to make it worth decompressing.

.. just tried ...

~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz
... 179.50 sec  25,000,250 triples  139,280.26 TPS

~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt
~ >> riot --time --sink X.nt
... 168.74 sec  25,000,250 triples  148,157.53 TPS

The parser streams ... only a ridiculous proportion of bNodes in large datasets should cause it to slow down.

Please let us know how it goes - it's all useful to build up a pool of experiences.

        Andy

[*] as you might guess, I've encountered various "issues" on VMs in the past due to under or mis-provisioning for a database server. AWS is OK, subject to not getting one of the occasional duff machines that people report; I've had one of these once.


On 26/02/13 21:31, Joshua Greben wrote:
Bill,

I tried this approach using a bash script, split, and loop through
the split files. I tried splitting them into 67 10M line files and 7
100M line files respectively, in the hopes that the process would
recover more memory from the JVM garbage collector with each serial
load and give a boost to the overall process. In both cases the
result was pretty much the same as it worked through the files. The
developers can correct me here but I believe that the slowdown occurs
when writing to the DB location as opposed to reading from the
NTriple file.

I am now experimenting with loading the 7 100M line files into 7 DB
locations and then working with a union query just to get something
going while I wait on a VM or dedicated server with more RAM. My
sysadmin wanted to start off by allocating 16G to the VM to see how
it goes, but now that I read Aaron's comment (thanks Aaron!) I think
I will forward that info along and try to get 40-60G right away.

- Josh

On Feb 26, 2013, at 11:00 AM, Bill Roberts wrote:

Since it's N-triples and so one triple per line, why not use unix
utilities (eg 'split') to divide it into lots of smaller chunks and
do a series of tdbloader uploads.  Should be fairly straightforward
to script in bash or other scripting language of your choice.  That
should have a lower memory requirement and so avoid the massive
slowdown.  Or am I missing something?

Bill

On 26 Feb 2013, at 18:23, Aaron Coburn <[email protected]>
wrote:

I recently had a need to load ~225M triples into a TDB
triplestore, and when allocating only ~12G to the triple loader,
I experienced the very same slowdowns you described. As an
alternative, I just reserved an on-demand, high memory (i.e.
~60GB) instance in the public cloud, and the processing completed
in only a few hours. I then just moved the files onto my local
server and proceeded from there.

Aaron Coburn


On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]>
wrote:

On 25/02/13 20:07, Joshua Greben wrote:
Hello All,

I am new to this list and to Jena and was wondering if anyone
could offer advice for loading a large triplestore.

I am trying to load 670M Ntriples into a store using
tdbloader on a single machine with 64-bit hardware and 8GB of
memory. However, I am running into a massive slowdown. When
the load starts the tdbloader is processing around 30K tps
but by the time it has loaded 130M triples it can essentially
no longer load any more and slows down to 2300 tps. At that
point I have to kill the process because it will basically
never finish.

Is 8GB of memory enough or is there a more efficient way to
load this data? I am trying to load the data into a single DB
location. Should I be splitting up the triples and loading
them into different DBs?

Advice from anyone who has experience successfully loading a
large triplestore is much appreciated.

Only 8G is pushing it somewhat for 670M triples.  It will
finish; it will take a very long time.  Faster loads have been
reported by using a larger machine (e.g. Freebase in 8 hours on
a IBM Power7 and 48G RAM).

tdbloader2 (Linux only) may get you there a bit quicker but
really you need a bigger machine.

Once built, you can copy the dataset as files to other
machines.

Andy


Thanks!

- Josh



Joshua Greben Library Systems Programmer & Analyst Stanford
University Libraries (650) 714-1937 [email protected]








Reply via email to