what jvm do you use on the machines?
On Wed, Jun 15, 2011 at 11:23 AM, Richard Francis <[email protected]> wrote: > Hi, > > I'm using two identical machines in ec2 running tdbloader on centos (CentOS > release 5 (Final)) and Ubuntu 11.04 (natty) > > I've observed an issue where Centos will run happily at a consistent speed > and complete a load of 650million triples in around 12 hours, whereas the > load on Ubuntu, after just 15million triples tails off and runs at an ever > increasing slower interval. > > On initial observation of the Ubuntu machine I noticed that the flush-202 > process was running quite high, also running iostat showed that io was the > real bottle neck - with the ubuntu machine showing a constant use of the > disk for both reads and writes (the centos machine had periods of no usage > followed by periods of writes). This led me to investigate how memory was > being used by the Ubuntu machine - and a few blog posts / tutorials later I > found a couple of settings to tweak - the first I tried > was dirty_writeback_centisecs - setting this to 0 had an immediate positive > effect on the load that I was performing - but after some more testing I > found that the problem was just put back to around 80million triples before > I saw a drop off on performance. > > This led me investigate whether there was the same issue with tdbloader2 - > From my observations I got the same problem - but this time around 150m > triples. > > Again - I focused on "dirty" settings - and this time tweaking dirty_bytes > = 30000000000 and dirty_background_bytes = 15000000000 saw a massive > performance increase and for the vast part of add phase of the tdbloader it > kept up with the centos machine. > > Finally, last night I stopped all loads, and raced the centos machine and > the ubuntu machine - both have completed - but the Centos machine (around 12 > hours) was still far quicker than the Ubuntu machine (20 hours). > > So my questions are, has anyone else observed this? - can anyone suggest any > further improvements - or things to try? - what is the best OS to perform a > tdbload on? > > Rich > > > Tests were performed on three different machines 1x Centos and 2 x Ubuntu - > to rule out EC2 being a bottle neck - all were (from > http://aws.amazon.com/ec2/instance-types/) > > High-Memory Double Extra Large Instance > > 34.2 GB of memory > 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each) > 850 GB of instance storage > 64-bit platform > I/O Performance: High > API name: m2.2xlarge > All machines are configured with no swap > > Here's the summary from the only completed load on Ubuntu; > > ** Index SPO->OSP: 685,552,449 slots indexed in 18,337.75 seconds [Rate: > 37,384.76 per second] > -- Finish triples index phase > ** 685,552,449 triples indexed in 37,063.51 seconds [Rate: 18,496.69 per > second] > -- Finish triples load > ** Completed: 685,552,449 triples loaded in 78,626.27 seconds [Rate: > 8,719.13 per second] > -- Finish quads load > > Some resources I used; > http://www.westnet.com/~gsmith/content/linux-pdflush.htm > http://arighi.blogspot.com/2008/10/fine-grained-dirtyratio-and.html > -- Marco Neumann KONA
