> So my questions are, has anyone else observed this? - can anyone suggest any > further improvements - or things to try? - what is the best OS to perform a
> tdbload on?

Richard - very useful feedback, thank you.

I haven't come across this before - and the difference is quite surprising.

What is the "mapped" value on each machine?
Could you "cat /proc/meminfo"?

TDB is using memory mapped files - I'm wondering if the amount of RAM available to the processes is different in some way. Together with the parameters you have found to have an efefct, this might have an effect (speculation I'm afraid).

Is the filesystem the same?
How big is the resulting dataset?

(sorry for all the questions!)

tdbloader2 works differently from tdbloader even during the data phase. It seems like it is the B+trees slowing down, there is only one in tdbloader2 phase one, but two in tdbloader phase one. That might explain the roughly 80 -> 150million (or x2).

        Andy

On 15/06/11 16:23, Richard Francis wrote:
Hi,

I'm using two identical machines in ec2 running tdbloader on centos (CentOS
release 5 (Final)) and Ubuntu 11.04 (natty)

I've observed an issue where Centos will run happily at a consistent speed
and complete a load of 650million triples in around 12 hours, whereas the
load on Ubuntu, after just 15million triples tails off and runs at an ever
increasing slower interval.

On initial observation of the Ubuntu machine I noticed that the flush-202
process was running quite high, also running iostat showed that io was the
real bottle neck - with the ubuntu machine showing a constant use of the
disk for both reads and writes (the centos machine had periods of no usage
followed by periods of writes). This led me to investigate how memory was
being used by the Ubuntu machine - and a few blog posts / tutorials later I
found a couple of settings to tweak - the first I tried
was dirty_writeback_centisecs - setting this to 0 had an immediate positive
effect on the load that I was performing - but after some more testing I
found that the problem was just put back to around 80million triples before
I saw a drop off on performance.

This led me investigate whether there was the same issue with tdbloader2 -
 From my observations I got the same problem - but this time around 150m
triples.

Again - I focused on "dirty" settings - and this time tweaking dirty_bytes
= 30000000000 and dirty_background_bytes = 15000000000 saw a massive
performance increase and for the vast part of add phase of the tdbloader it
kept up with the centos machine.

Finally, last night I stopped all loads, and raced the centos machine and
the ubuntu machine - both have completed - but the Centos machine (around 12
hours) was still far quicker than the Ubuntu machine (20 hours).

So my questions are, has anyone else observed this? - can anyone suggest any
further improvements - or things to try? - what is the best OS to perform a
tdbload on?

Rich


Tests were performed on three different machines 1x Centos and 2 x Ubuntu -
to rule out EC2 being a bottle neck - all were  (from
http://aws.amazon.com/ec2/instance-types/)

High-Memory Double Extra Large Instance

34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.2xlarge
All machines are configured with no swap

Here's the summary from the only completed load on Ubuntu;

** Index SPO->OSP: 685,552,449 slots indexed in 18,337.75 seconds [Rate:
37,384.76 per second]
-- Finish triples index phase
** 685,552,449 triples indexed in 37,063.51 seconds [Rate: 18,496.69 per
second]
-- Finish triples load
** Completed: 685,552,449 triples loaded in 78,626.27 seconds [Rate:
8,719.13 per second]
-- Finish quads load

Some resources I used;
http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://arighi.blogspot.com/2008/10/fine-grained-dirtyratio-and.html

Reply via email to