Bill, I tried this approach using a bash script, split, and loop through the split files. I tried splitting them into 67 10M line files and 7 100M line files respectively, in the hopes that the process would recover more memory from the JVM garbage collector with each serial load and give a boost to the overall process. In both cases the result was pretty much the same as it worked through the files. The developers can correct me here but I believe that the slowdown occurs when writing to the DB location as opposed to reading from the NTriple file.
I am now experimenting with loading the 7 100M line files into 7 DB locations and then working with a union query just to get something going while I wait on a VM or dedicated server with more RAM. My sysadmin wanted to start off by allocating 16G to the VM to see how it goes, but now that I read Aaron's comment (thanks Aaron!) I think I will forward that info along and try to get 40-60G right away. - Josh On Feb 26, 2013, at 11:00 AM, Bill Roberts wrote: > Since it's N-triples and so one triple per line, why not use unix utilities > (eg 'split') to divide it into lots of smaller chunks and do a series of > tdbloader uploads. Should be fairly straightforward to script in bash or > other scripting language of your choice. That should have a lower memory > requirement and so avoid the massive slowdown. Or am I missing something? > > Bill > > On 26 Feb 2013, at 18:23, Aaron Coburn <[email protected]> wrote: > >> I recently had a need to load ~225M triples into a TDB triplestore, and when >> allocating only ~12G to the triple loader, I experienced the very same >> slowdowns you described. As an alternative, I just reserved an on-demand, >> high memory (i.e. ~60GB) instance in the public cloud, and the processing >> completed in only a few hours. I then just moved the files onto my local >> server and proceeded from there. >> >> Aaron Coburn >> >> >> On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]> wrote: >> >>> On 25/02/13 20:07, Joshua Greben wrote: >>>> Hello All, >>>> >>>> I am new to this list and to Jena and was wondering if anyone could >>>> offer advice for loading a large triplestore. >>>> >>>> I am trying to load 670M Ntriples into a store using tdbloader on a >>>> single machine with 64-bit hardware and 8GB of memory. However, I am >>>> running into a massive slowdown. When the load starts the tdbloader >>>> is processing around 30K tps but by the time it has loaded 130M >>>> triples it can essentially no longer load any more and slows down to >>>> 2300 tps. At that point I have to kill the process because it will >>>> basically never finish. >>>> >>>> Is 8GB of memory enough or is there a more efficient way to load this >>>> data? I am trying to load the data into a single DB location. Should >>>> I be splitting up the triples and loading them into different DBs? >>>> >>>> Advice from anyone who has experience successfully loading a large >>>> triplestore is much appreciated. >>> >>> Only 8G is pushing it somewhat for 670M triples. It will finish; it will >>> take a very long time. Faster loads have been reported by using a larger >>> machine (e.g. Freebase in 8 hours on a IBM Power7 and 48G RAM). >>> >>> tdbloader2 (Linux only) may get you there a bit quicker but really you need >>> a bigger machine. >>> >>> Once built, you can copy the dataset as files to other machines. >>> >>> Andy >>> >>>> >>>> Thanks! >>>> >>>> - Josh >>>> >>>> >>>> >>>> Joshua Greben Library Systems Programmer & Analyst Stanford >>>> University Libraries (650) 714-1937 [email protected] >>>> >>>> >>>> >>> >> >
