Murali Vilayannur wrote:
Hi Phil,
First of all, great work!
There are 2 other parameters that I thought could also make an impact
a) Choice of file-system (this was  investigated by Nathan last year, I
think) as well as choice of journaling modes.

We looked at this a little bit again recently, but not in the context of buffer cache saturation. I don't have numbers handy, but I can share some general impressions. It still looks like data=writeback is stil probably the fastest mode in general, although it isn't entirely clear to me what the filesystem integrity tradeoff is. I had high hopes for data=journal after reading about experiences other people had (outside of the PVFS2 world), but it was a bust. data=journal did better than the default options for berkeley db and metadata intensive access, but absolutely stunk for I/O throughput.

All of these tests were done with default ext3 options.

b) In case of storage-spaces created directly on top of an IDE/SCSI disks
or attached to local RAID controllers,  there must be a way to enable hd
parameters like write-caching/tcq at the disks (hdparm -W /dev/?,...),
are they may be on by default?) (although that might conflict with the
goal of stability of data on disks)

It is interesting that time is not as important a criterion as VM ratio
(i.e. dirty_writeback_centisecs/dirty_expire_centisecs) for such
write-intensive workloads..

I was surprised by this too. I was just ad-hoc turning these values up and down, so there is a good chance that I missed something, but I couldn't get any of those values to improve performance in a positive way.


A.      Is the AIO interface causing delays?
B.      Is the linux kernel waiting too long to start writing out its
buffer cache?
C.      Is the linux kernel disk scheduler appropriate for PVFS2?

We can change this behavior by adjusting the /proc/sys/vm/dirty* files.
They are documented in the Documentation/filesystems/proc.txt file in
the linux kernel source.  The only one that really ended up being
interesting for us (after trial and error) is the dirty_ratio file.  The
explanation given in the documentation is: "Contains, as a percentage of
total system memory, the number of pages at which a process which is
generating disk writes will itself start writing out dirty data.".  It
defaults to 40, but some of the results below show what happens when it
is set to 1.  There is also a dirty_background_ratio file, which
controls when pdflush decides to write out data in the background.  That
would seem to be the more desirable tweak, but it didn't have the effect
that dirty_ratio did for some reason.


Could this because the pdflush daemon does not wake up regularly enough to
start flushing things out? Or does pdflush wake up
a) either if timeout passes (or)
b) ratio is reached?

I am guessing it only wakes up when the timeout passes, but I don't really know. I tried cranking down the dirty_background_ratio in conjunction with reducing those *centisecs values so it would wake up quicker and start writing out, but it still didn't help like the dirty_ratio did.

Hey, one other thing which struck me. How much memory was there on this
machine? Is this on an IA 32 machine running a kernel with CONFIG_HIGHMEM?
Thanks!
Murali

This was actually a dual proc (appears to be 4 procs with hyperthreading) xeon box running in x86_64 mode, with 4 GB of memory.

-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to