Background:
This whole issue started off while trying to debug the PVFS2
stall/timeout problem that ended up being caused by the ext3
reservation bug... but we found some interesting things along the way.
One of the things we noticed while looking at the problem is that
occasionally a Trove write operation would take much longer than
expected; essentially stalling all I/O for a while. So we wrote
some small benchmark programs to look at the issue outside of
PVFS2. These benchmarks (in the cases shown here) write 8 G of
data, 256K at a time. They show the stall also. We ended up
changing some PVFS2 timeouts to avoid the problem (see earlier email).
We then started trying to figure out why the writes stall
sometimes, because that seemed like a bad thing regardless of
whether the timeouts could handle it or if the kernel bug was fixed :)
These tests look at three possibilities:
A. Is the AIO interface causing delays?
B. Is the linux kernel waiting too long to start writing out
its buffer cache?
C. Is the linux kernel disk scheduler appropriate for PVFS2?
To test A:
The benchmark can run in 2 modes. The first uses AIO (as in
PVFS2), allowing a maximum of 16 concurrent writes at a time. The
second doesn't use AIO or threads at all, but instead does each
write one at a time with the pwrite() function.
To test B:
We can change this behavior by adjusting the /proc/sys/vm/dirty*
files. They are documented in the
Documentation/filesystems/proc.txt file in the linux kernel
source. The only one that really ended up being interesting for us
(after trial and error) is the dirty_ratio file. The explanation
given in the documentation is: "Contains, as a percentage of total
system memory, the number of pages at which a process which is
generating disk writes will itself start writing out dirty data.".
It defaults to 40, but some of the results below show what happens
when it is set to 1. There is also a dirty_background_ratio file,
which controls when pdflush decides to write out data in the
background. That would seem to be the more desirable tweak, but it
didn't have the effect that dirty_ratio did for some reason.
To test C:
Reboot the machine with different I/O schedulers specified. CFQ
scheduler is the default, but we set it to the AS (anticipatory)
scheduler using "elevator=as" in kernel command line. The other
scheduler options (deadline, noop) didn't change much. The schedulers
also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
but they didn't seem to impact much either. The schedulers are
somewhat
documented in the Documentation/block subdirectory in the linux kernel
source.
The results are listed below. The benchmarks show 3 things: The
maximum
time that any individual write (during the course of the entire
test run) took, the average individual write time, and then the
total benchmark time. Everything is shown in seconds.
The maximum single write time is what would have shown up as a long
"stall" in the PVFS2 I/O realm, so that is the most interesting value
in terms of our original problem.
A few things to point out:
- the choice of aio/pwrite didn't really matter a whole lot.
Individual aio operations take longer than pwrite, but they are
overlapped and end up giving basically the same overall throughput.
- the io scheduler and buffer cache settings can have a big impact
- this wasn't the point of the test, but in this particular setup
the san is actually a little slower than local disk for writes
(this is an old san setup)
local disk results:
- using the AS scheduler reduced the maximum stall time
significantly and improved total benchmark run time
- setting the dirty ratio to 1 further reduced the maximum stall
time, but also seemed to increase the total benchmark run time a
little (maybe there is a sweet spot between 40 and 1 for this value
that doesn't penalize the throughput as much?)
san results:
- the AS scheduler didn't really help
- setting the dirty ratio to 1 reduced the maximum stall time
significantly
Maximum single write time
-------------------------
default AS AS,dirty_ratio=1
aio local 30.874424 2.040070 0.907068
pwrite local 28.146439 4.423536 1.052867
aio san 46.486595 46.813606 6.161530
pwrite san 17.991354 10.994622 6.119389
Average single write time
-------------------------
default AS AS,dirty_ratio=1
aio local 0.061520 0.057819 0.064450
pwrite local 0.003711 0.003567 0.004022
aio san 0.095062 0.096853 0.095410
pwrite san 0.005551 0.005713 0.005619
Total benchmark time
-------------------------
default AS AS,dirty_ratio=1
aio local 252.018623 236.855234 264.018140
pwrite local 243.552892 234.140043 263.995362
aio san 389.380213 396.724146 390.813488
pwrite san 364.203958 374.827604 368.691822
These results aren't super scientific- in all cases it is just one
test run per data point and no averaging. We also didn't
exhaustively try many parameter combinations. This is also a
write-only test; no telling what these parameter do to other
workloads.
We don't really have time to follow through with this any further,
but it does show that these VM and iosched settings might be
interesting to tune in some cases.
If anyone has any similar experiences to share we would love to
hear about it.
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers