[Pvfs2-developers] tuning the 2.6 kernels for write performance

Phil Carns Fri, 24 Mar 2006 06:51:36 -0800

Background:

This whole issue started off while trying to debug the PVFS2stall/timeout problem that ended up being caused by the ext3 reservationbug... but we found some interesting things along the way.


One of the things we noticed while looking at the problem is that

occasionally a Trove write operation would take much longer thanexpected; essentially stalling all I/O for a while. So we wrote somesmall benchmark programs to look at the issue outside of PVFS2. Thesebenchmarks (in the cases shown here) write 8 G of data, 256K at a time.They show the stall also. We ended up changing some PVFS2 timeouts toavoid the problem (see earlier email).

We then started trying to figure out why the writes stall sometimes,because that seemed like a bad thing regardless of whether the timeoutscould handle it or if the kernel bug was fixed :)


These tests look at three possibilities:

A.      Is the AIO interface causing delays?

B. Is the linux kernel waiting too long to start writing out itsbuffer cache?

C.      Is the linux kernel disk scheduler appropriate for PVFS2?

To test A:

The benchmark can run in 2 modes. The first uses AIO (as in PVFS2),allowing a maximum of 16 concurrent writes at a time. The second doesn'tuse AIO or threads at all, but instead does each write one at a timewith the pwrite() function.


To test B:

We can change this behavior by adjusting the /proc/sys/vm/dirty* files.They are documented in the Documentation/filesystems/proc.txt file inthe linux kernel source. The only one that really ended up beinginteresting for us (after trial and error) is the dirty_ratio file. Theexplanation given in the documentation is: "Contains, as a percentage oftotal system memory, the number of pages at which a process which isgenerating disk writes will itself start writing out dirty data.". Itdefaults to 40, but some of the results below show what happens when itis set to 1. There is also a dirty_background_ratio file, whichcontrols when pdflush decides to write out data in the background. Thatwould seem to be the more desirable tweak, but it didn't have the effectthat dirty_ratio did for some reason.


To test C:

Reboot the machine with different I/O schedulers specified. CFQscheduler is the default, but we set it to the AS (anticipatory)scheduler using "elevator=as" in kernel command line. The otherscheduler options (deadline, noop) didn't change much. The schedulers

also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
but they didn't seem to impact much either.  The schedulers are somewhat
documented in the Documentation/block subdirectory in the linux kernel
source.

The results are listed below.  The benchmarks show 3 things: The maximum

time that any individual write (during the course of the entire testrun) took, the average individual write time, and then the totalbenchmark time. Everything is shown in seconds.


The maximum single write time is what would have shown up as a long
"stall" in the PVFS2 I/O realm, so that is the most interesting value
in terms of our original problem.

A few things to point out:

- the choice of aio/pwrite didn't really matter a whole lot. Individualaio operations take longer than pwrite, but they are overlapped and endup giving basically the same overall throughput.

- the io scheduler and buffer cache settings can have a big impact

- this wasn't the point of the test, but in this particular setup thesan is actually a little slower than local disk for writes (this is anold san setup)


local disk results:
- using the AS scheduler reduced the maximum stall time
significantly and improved total benchmark run time

- setting the dirty ratio to 1 further reduced the maximum stall time,but also seemed to increase the total benchmark run time a little (maybethere is a sweet spot between 40 and 1 for this value that doesn'tpenalize the throughput as much?)


san results:
- the AS scheduler didn't really help
- setting the dirty ratio to 1 reduced the maximum stall time significantly

Maximum single write time
-------------------------
                         default       AS            AS,dirty_ratio=1
aio local                30.874424     2.040070      0.907068
pwrite local             28.146439     4.423536      1.052867

aio san                  46.486595     46.813606     6.161530
pwrite san               17.991354     10.994622     6.119389

Average single write time
-------------------------
                         default       AS            AS,dirty_ratio=1
aio local                0.061520      0.057819      0.064450
pwrite local             0.003711      0.003567      0.004022

aio san                  0.095062      0.096853      0.095410
pwrite san               0.005551      0.005713      0.005619

Total benchmark time
-------------------------
                         default       AS            AS,dirty_ratio=1
aio local                252.018623    236.855234    264.018140
pwrite local             243.552892    234.140043    263.995362

aio san                  389.380213    396.724146    390.813488
pwrite san               364.203958    374.827604    368.691822

These results aren't super scientific- in all cases it is just one testrun per data point and no averaging. We also didn't exhaustively trymany parameter combinations. This is also a write-only test; no tellingwhat these parameter do to other workloads.

We don't really have time to follow through with this any further, butit does show that these VM and iosched settings might be interesting totune in some cases.

If anyone has any similar experiences to share we would love to hearabout it.


-Phil



_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] tuning the 2.6 kernels for write performance

Reply via email to