Hi,

I'm using two identical machines in ec2 running tdbloader on centos (CentOS
release 5 (Final)) and Ubuntu 11.04 (natty)

I've observed an issue where Centos will run happily at a consistent speed
and complete a load of 650million triples in around 12 hours, whereas the
load on Ubuntu, after just 15million triples tails off and runs at an ever
increasing slower interval.

On initial observation of the Ubuntu machine I noticed that the flush-202
process was running quite high, also running iostat showed that io was the
real bottle neck - with the ubuntu machine showing a constant use of the
disk for both reads and writes (the centos machine had periods of no usage
followed by periods of writes). This led me to investigate how memory was
being used by the Ubuntu machine - and a few blog posts / tutorials later I
found a couple of settings to tweak - the first I tried
was dirty_writeback_centisecs - setting this to 0 had an immediate positive
effect on the load that I was performing - but after some more testing I
found that the problem was just put back to around 80million triples before
I saw a drop off on performance.

This led me investigate whether there was the same issue with tdbloader2 -
>From my observations I got the same problem - but this time around 150m
triples.

Again - I focused on "dirty" settings - and this time tweaking dirty_bytes
= 30000000000 and dirty_background_bytes = 15000000000 saw a massive
performance increase and for the vast part of add phase of the tdbloader it
kept up with the centos machine.

Finally, last night I stopped all loads, and raced the centos machine and
the ubuntu machine - both have completed - but the Centos machine (around 12
hours) was still far quicker than the Ubuntu machine (20 hours).

So my questions are, has anyone else observed this? - can anyone suggest any
further improvements - or things to try? - what is the best OS to perform a
tdbload on?

Rich


Tests were performed on three different machines 1x Centos and 2 x Ubuntu -
to rule out EC2 being a bottle neck - all were  (from
http://aws.amazon.com/ec2/instance-types/)

High-Memory Double Extra Large Instance

34.2 GB of memory
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each)
850 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.2xlarge
All machines are configured with no swap

Here's the summary from the only completed load on Ubuntu;

** Index SPO->OSP: 685,552,449 slots indexed in 18,337.75 seconds [Rate:
37,384.76 per second]
-- Finish triples index phase
** 685,552,449 triples indexed in 37,063.51 seconds [Rate: 18,496.69 per
second]
-- Finish triples load
** Completed: 685,552,449 triples loaded in 78,626.27 seconds [Rate:
8,719.13 per second]
-- Finish quads load

Some resources I used;
http://www.westnet.com/~gsmith/content/linux-pdflush.htm
http://arighi.blogspot.com/2008/10/fine-grained-dirtyratio-and.html

Reply via email to