Bill,

I tried this approach using a bash script, split, and loop through the split 
files. I tried splitting them into 67 10M line files and 7 100M line files 
respectively, in the hopes that the process would recover more memory from the 
JVM garbage collector with each serial load and give a boost to the overall 
process. In both cases the result was pretty much the same as it worked through 
the files. The developers can correct me here but I believe that the slowdown 
occurs when writing to the DB location as opposed to reading from the NTriple 
file.

I am now experimenting with loading the 7 100M line files into 7 DB locations 
and then working with a union query just to get something going while I wait on 
a VM or dedicated server with more RAM. My sysadmin wanted to start off by 
allocating 16G to the VM to see how it goes, but now that I read Aaron's 
comment (thanks Aaron!) I think I will forward that info along and try to get 
40-60G right away.

- Josh

On Feb 26, 2013, at 11:00 AM, Bill Roberts wrote:

> Since it's N-triples and so one triple per line, why not use unix utilities 
> (eg 'split') to divide it into lots of smaller chunks and do a series of 
> tdbloader uploads.  Should be fairly straightforward to script in bash or 
> other scripting language of your choice.  That should have a lower memory 
> requirement and so avoid the massive slowdown.  Or am I missing something?
> 
> Bill
> 
> On 26 Feb 2013, at 18:23, Aaron Coburn <[email protected]> wrote:
> 
>> I recently had a need to load ~225M triples into a TDB triplestore, and when 
>> allocating only ~12G to the triple loader, I experienced the very same 
>> slowdowns you described. As an alternative, I just reserved an on-demand, 
>> high memory (i.e. ~60GB) instance in the public cloud, and the processing 
>> completed in only a few hours. I then just moved the files onto my local 
>> server and proceeded from there.
>> 
>> Aaron Coburn
>> 
>> 
>> On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]> wrote:
>> 
>>> On 25/02/13 20:07, Joshua Greben wrote:
>>>> Hello All,
>>>> 
>>>> I am new to this list and to Jena and was wondering if anyone could
>>>> offer advice for loading a large triplestore.
>>>> 
>>>> I am trying to load 670M Ntriples into a store using tdbloader on a
>>>> single machine with 64-bit hardware and 8GB of memory. However, I am
>>>> running into a massive slowdown. When the load starts the tdbloader
>>>> is processing around 30K tps but by the time it has loaded 130M
>>>> triples it can essentially no longer load any more and slows down to
>>>> 2300 tps. At that point I have to kill the process because it will
>>>> basically never finish.
>>>> 
>>>> Is 8GB of memory enough or is there a more efficient way to load this
>>>> data? I am trying to load the data into a single DB location. Should
>>>> I be splitting up the triples and loading them into different DBs?
>>>> 
>>>> Advice from anyone who has experience successfully loading a large
>>>> triplestore is much appreciated.
>>> 
>>> Only 8G is pushing it somewhat for 670M triples.  It will finish; it will 
>>> take a very long time.  Faster loads have been reported by using a larger 
>>> machine (e.g. Freebase in 8 hours on a IBM Power7 and 48G RAM).
>>> 
>>> tdbloader2 (Linux only) may get you there a bit quicker but really you need 
>>> a bigger machine.
>>> 
>>> Once built, you can copy the dataset as files to other machines.
>>> 
>>>     Andy
>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> - Josh
>>>> 
>>>> 
>>>> 
>>>> Joshua Greben Library Systems Programmer & Analyst Stanford
>>>> University Libraries (650) 714-1937 [email protected]
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to