Re: [Pvfs2-developers] tuning the 2.6 kernels for write performance

Phil Carns Fri, 24 Mar 2006 13:27:13 -0800

This stuff seems kind of promising:

http://www.bullopensource.org/posix/

They have a posix compliant aio interface sitting on top of the kernelaio interface. They also have some kernel patches to eliminate theO_DIRECT (and alignment limitations?) that normally come along withusing the new kernel aio interface.

I think someone (maybe one of the suse versions?) integrated somethinglike this in their glibc release, but it doesn't look like the officialglibc maintainers are going to do anything with this until it maturesquite a bit more.

I'm too worried about stability to be able to play with it but I amcurious how it performs.

I agree about focusing on some of these ideas later. For our purposeswe stopped fiddling with alternate aio idea after it didn't offersignificant immediate help.


-Phil

Rob Ross wrote:

Hey,
I have been constantly disappointed with the amount of work we have todo to try to get decent performance out of the Linux VFS stack and GNUlibc, and this is just the latest example of that problem.
Thanks for sharing your experience with us. I keep thinking that theLinux VFS async I/O support is going to get to the point where we canuse it instead of relying on GNU's implementation, and that somehow thatwill perform considerably better. Has anyone checked the status of theinternal aio calls lately?
In the mean time, I think we should focus on some of these tuningparameters that appear to get us immediate gains with relatively smalleffort. Once we've extracted what we can out of those (and the DBchanges), we can come back to this one.
I wonder if, a year or two from now, we will wish we had just managedthe storage on our own? There are a couple of groups out there that arelooking at other ways to manage the storage (UNL, FIU) that might helpus on this front as well.
Regards,

Rob

Phil Carns wrote:
As far as the AIO stuff goes, we kicked around an idea here thatdidn't really help the workloads that we were looking at in this case,but it may help something else.
If you look at what aio does, it spawns off threads for each fd up toa limit that is tunable by calling aio_init() call (using theaio_threads field) but defaults to 16. A given thread will runthrough each of its aiocb arrays calling pread or pwrite asappropriate. Any time it finishes an array it spawns a new thread indetached mode to trigger the callback function that the caller told itto use for notification. There is a certain amount of lockingqueueing, etc. associated with doing all of this stuff. Threads timeout and exit after a while, then get re-spawned when someone postsmore aio work.
What you could do instead (and what we tried) was to implement areplacement for lio_listio that is much simpler. Instead of the abovemechanism, when you call lio_listio it just immediately spawns off adetached thread, which needs no locks or queues. The thread does thepreads or pwrites as needed then invokes the callback function itselfand exits.
We could get away with something like this in PVFS2 becuase:
- we don't care what order the reads/writes get serviced at the Trovelevel, so no need to queue for semantic reasons (the request scheduleralready provides the semantics we require).- it doesn't cause any extra thread use that wasn't already there- thenormal aio implementation already spawns a new thread for everycallback (and thread creation is relatively cheap these days anywaywith NPTL)- Trove already limits the max number of AIO's in progress to 16, sothere isn't any danger of spawning too many threads- we don't care about the other notification methods (signals, pollingetc.) and don't use any other significant aio api functions
I guess really at that point there isn't any particular reason to evenbother mimicing the aio API, except that it makes it easy to plug intothe existing trove code.
Our try at this was just a quick hack (someone would have to tinkermore to make sure it propigates error codes, handle array sizes > 1,etc.).
For what we were looking at it wasn't really any faster than normalAIO, so we shelved the idea for now. I still think it might beinteresting for some workload or another if someone took the time toimplement it right and do more testing.
-Phil


Avery Ching wrote:
Phil, I've done some tests for noncontiguous I/O comparing the
lio_listio, aio_read/aio_write, and normal read/write.  In cases where
there are a lot of noncontiguous regions, lio_listio and aio tend to
really fall behind.  At least 1 order of magnitude slower than normal
read/write.

Avery

On Fri, 2006-03-24 at 10:49 -0600, Rob Ross wrote:
Nice Phil. I saw this exact same sort of stalling eight years ago ongrendel at Clemson! But we didn't have alternative schedulers andthe like to play with at the time.
It might be worth our time to explore the dirty_ratio value a littlemore in the context of both I/O and metadata tests. Perhaps once theDBPF changes are merged in we can spend some time on this?
Rob

Phil Carns wrote:
Background:
This whole issue started off while trying to debug the PVFS2stall/timeout problem that ended up being caused by the ext3reservation bug... but we found some interesting things along the way.
One of the things we noticed while looking at the problem is that
occasionally a Trove write operation would take much longer thanexpected; essentially stalling all I/O for a while. So we wrotesome small benchmark programs to look at the issue outside ofPVFS2. These benchmarks (in the cases shown here) write 8 G ofdata, 256K at a time. They show the stall also. We ended upchanging some PVFS2 timeouts to avoid the problem (see earlier email).
We then started trying to figure out why the writes stallsometimes, because that seemed like a bad thing regardless ofwhether the timeouts could handle it or if the kernel bug was fixed :)
These tests look at three possibilities:

A.      Is the AIO interface causing delays?
B. Is the linux kernel waiting too long to start writing outits buffer cache?
C.      Is the linux kernel disk scheduler appropriate for PVFS2?

To test A:
The benchmark can run in 2 modes. The first uses AIO (as inPVFS2), allowing a maximum of 16 concurrent writes at a time. Thesecond doesn't use AIO or threads at all, but instead does eachwrite one at a time with the pwrite() function.
To test B:
We can change this behavior by adjusting the /proc/sys/vm/dirty*files. They are documented in theDocumentation/filesystems/proc.txt file in the linux kernelsource. The only one that really ended up being interesting for us(after trial and error) is the dirty_ratio file. The explanationgiven in the documentation is: "Contains, as a percentage of totalsystem memory, the number of pages at which a process which isgenerating disk writes will itself start writing out dirty data.".It defaults to 40, but some of the results below show what happenswhen it is set to 1. There is also a dirty_background_ratio file,which controls when pdflush decides to write out data in thebackground. That would seem to be the more desirable tweak, but itdidn't have the effect that dirty_ratio did for some reason.
To test C:
Reboot the machine with different I/O schedulers specified. CFQscheduler is the default, but we set it to the AS (anticipatory)scheduler using "elevator=as" in kernel command line. The otherscheduler options (deadline, noop) didn't change much. The schedulers
also have tunable parameters in /sys/block/<DEVICE>/queue/iosched/*,
but they didn't seem to impact much either. The schedulers aresomewhat
documented in the Documentation/block subdirectory in the linux kernel
source.
The results are listed below. The benchmarks show 3 things: Themaximumtime that any individual write (during the course of the entiretest run) took, the average individual write time, and then thetotal benchmark time. Everything is shown in seconds.
The maximum single write time is what would have shown up as a long
"stall" in the PVFS2 I/O realm, so that is the most interesting value
in terms of our original problem.

A few things to point out:
- the choice of aio/pwrite didn't really matter a whole lot.Individual aio operations take longer than pwrite, but they areoverlapped and end up giving basically the same overall throughput.
- the io scheduler and buffer cache settings can have a big impact
- this wasn't the point of the test, but in this particular setupthe san is actually a little slower than local disk for writes(this is an old san setup)
local disk results:
- using the AS scheduler reduced the maximum stall time
significantly and improved total benchmark run time
- setting the dirty ratio to 1 further reduced the maximum stalltime, but also seemed to increase the total benchmark run time alittle (maybe there is a sweet spot between 40 and 1 for this valuethat doesn't penalize the throughput as much?)
san results:
- the AS scheduler didn't really help
- setting the dirty ratio to 1 reduced the maximum stall timesignificantly
Maximum single write time
-------------------------
                        default       AS            AS,dirty_ratio=1
aio local                30.874424     2.040070      0.907068
pwrite local             28.146439     4.423536      1.052867

aio san                  46.486595     46.813606     6.161530
pwrite san               17.991354     10.994622     6.119389

Average single write time
-------------------------
                        default       AS            AS,dirty_ratio=1
aio local                0.061520      0.057819      0.064450
pwrite local             0.003711      0.003567      0.004022

aio san                  0.095062      0.096853      0.095410
pwrite san               0.005551      0.005713      0.005619

Total benchmark time
-------------------------
                        default       AS            AS,dirty_ratio=1
aio local                252.018623    236.855234    264.018140
pwrite local             243.552892    234.140043    263.995362

aio san                  389.380213    396.724146    390.813488
pwrite san               364.203958    374.827604    368.691822
These results aren't super scientific- in all cases it is just onetest run per data point and no averaging. We also didn'texhaustively try many parameter combinations. This is also awrite-only test; no telling what these parameter do to otherworkloads.
We don't really have time to follow through with this any further,but it does show that these VM and iosched settings might beinteresting to tune in some cases.
If anyone has any similar experiences to share we would love tohear about it.
-Phil



_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] tuning the 2.6 kernels for write performance

Reply via email to