Re: Loading Large TripleStore using TDB

Joshua Greben Fri, 29 Mar 2013 12:45:28 -0700

Yes, there are a lot of large literals. 

I am using tdbloader.


The OS is Red Hat 6.3. The hardware is definitely shared, but I am not sure 
what the other applications are. I will have to ask the Sys Admin. If memory 
serves me the hardware has something like 90-ish GB or RAM, so I currently have 
the lion's share. All the RAM was being used by the tdbloader at the time. I 
will also have to inquire about a possible ulimit on memory-mapped files.

I don't think that I mentioned this, but I gave 3200M each as the JVM_ARGS (Xms 
and Xmx). Before that I tried giving it 60G but it started swapping on the GC. 
Maybe there is a happy median?

Thanks again for the pointers and advice.

- Josh

On Mar 29, 2013, at 4:50 AM, Andy Seaborne wrote:

> On 27/03/13 20:46, David Jordan wrote:
>> 
>> What indexes exist during the load?
> 
> SPO, POS, OSP -- it's a fixed set for TDB (OK, it is changeable but not the 
> facility is not exposed).
> 
> More below ...
> 
>> 
>> On Mar 27, 2013, at 4:40 PM, Joshua Greben wrote:
>> 
>>> Hello all,
>>> 
>>> I just wanted to give an update on how my loading of 670M triples was 
>>> going, or in this case not going.
>>> 
>>> I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was 
>>> the result:
>>> 14,128.77 sec  676,740,132 triples  47,898.03 TPS
> 
> 47K is slowish - are there lost of large literals?
> 
> But it may reflect the fact that I/O on the VM is slow.
> 
> (you may have answered this - the gap in thread reflects a gap in my memory 
> :-)
> 
>>> 
>>> According to this result it should finish in 4 hours, however I am still 
>>> seeing the same slowdown as before, without any swapping. Here are the 
>>> first and last few lines of the log file (with only the number of triples 
>>> processed and the TPS displayed):
>>> 
> 
> Which loader are you trying?
> 
>>> 50,000 15,165
>>> 100,000 21,886
>>> 150,000 25,231
>>> 200,000 27,074
>>> 250,000 28,344
>>> 300,000 29,620
>>> 350,000 30,261
>>> 400,000 31,063
>>> 450,000 31,768
>>> 500,000 31,752
> 
> That's all quite slow - for a common distribution of triples, it can be going 
> at 70-80K at this point.
> 
>>> ...
>>> 266,350,000 1,564
>>> 266,400,000 1,551
>>> 266,450,000 1,539
>>> 266,500,000 1,522
>>> 
>>> 
>>> My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any 
>>> ideas why I can't get a decent processing time? It seems like I would be 
>>> able to load ~225M triples into three stores over the course of a day or 
>>> two, but I would rather be able to have one massive store if that is even 
>>> possible.
> 
> That is surprisingly slow.  I do loads of 200M-300M size on a smaller box.
> 
> What's the OS? Is the hardware shared? (yes - it happens - I've seen setups 
> of 8G VMs ... 4 on an 8G box)?  And is VM on hardware with more than 64G?
> 
> Is all the RAM getting used? Some Linux setups ulimit the space allowed for 
> memory mapped files so you can give all the RAM you like but it does not get 
> used.
> 
>       Andy
> 
> 
>>> 
>>> Any ideas appreciated.
>>> 
>>> Thanks
>>> 
>>> - Josh
>>> 
>>> On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote:
>>> 
>>>> Joshua,
>>>> 
>>>> If you're in a VM you have another layer trying to help and, in my 
>>>> experience, it does not.  Sometimes, maliciously so. [*]
>>>> 
>>>> And sharing a real machine can mean you are contending for real resources 
>>>> like disk or memory bandwidth.
>>>> 
>>>> In that setup, whatever you do, make sure the VM is not swapping on the 
>>>> real machine.  That will make memory mapped files very bad (a sort of 
>>>> double swapping effect).
>>>> 
>>>> Reading the N-triples file is unlikely to be the bottle neck.  The VM 
>>>> should not make too much difference - only large chunks of stream read I/O 
>>>> is being done (though I have encountered VMs that seem to make that a 
>>>> cost).
>>>> 
>>>> You can test this with
>>>> 
>>>> riot --time --sink ... files ...
>>>> 
>>>> On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get 
>>>> 100K-150K triples/s BSBM from .gz.nt files.  BSBM isn't the faster to 
>>>> parse as it has some moderately long literals (or "more bytes to copy" as 
>>>> the parser sees it).
>>>> 
>>>> A bit faster from .nt but not enough to make it worth decompressing.
>>>> 
>>>> .. just tried ...
>>>> 
>>>> ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz
>>>> ... 179.50 sec  25,000,250 triples  139,280.26 TPS
>>>> 
>>>> ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt
>>>> ~ >> riot --time --sink X.nt
>>>> ... 168.74 sec  25,000,250 triples  148,157.53 TPS
>>>> 
>>>> The parser streams ... only a ridiculous proportion of bNodes in large 
>>>> datasets should cause it to slow down.
>>>> 
>>>> Please let us know how it goes - it's all useful to build up a pool of 
>>>> experiences.
>>>> 
>>>>    Andy
>>>> 
>>>> [*] as you might guess, I've encountered various "issues" on VMs in the 
>>>> past due to under or mis-provisioning for a database server. AWS is OK, 
>>>> subject to not getting one of the occasional duff machines that people 
>>>> report; I've had one of these once.
>>>> 
>>>> 
>>> 
>> 
>

Re: Loading Large TripleStore using TDB

Reply via email to