tuning vfs.zfs.vdev.max_pending and solving the issue of ZFS writes choking read IO

2010-03-24 Thread Dan Naumov
Hello

I am having a slight issue (and judging by Google results, similar
issues have been seen by other FreeBSD and Solaris/OpenSolaris users)
with writes choking the read IO. The issue I am having is described
pretty well here:
http://opensolaris.org/jive/thread.jspa?threadID=106453 It seems that
under heavy write load, ZFS likes to aggregate a really huge amount of
data before actually writing it to disks, resulting in sudden 10+
second stalls where it frantically tries to commit everything,
completely choking read IO in the process and sometimes even the
network (with a large enough write to a mirror pool using DD, I can
cause my SSH sessions to drop dead, without actually running out of
RAM. As soon as the data is committed, I can reconnect back).

Beyond the issue of system interactivity (or rather, the
near-disappearance thereof) during these enormous flushes, this kind
of pattern seems really ineffective from the CPU utilization point of
view. Instead of a relatively stable and consistent flow of reads and
writes, allowing the CPU to be utilized as much as possible, when the
system is committing the data the CPU basically stays IDLE for 10+
seconds (or as long as the flush takes) and the process of committing
unwritten data to the pool seemingly completely trounces the priority
of any read operations.

Has anyone done any extensive testing of the effects of tuning
vfs.zfs.vdev.max_pending on this issue? Is there some universally
recommended value beyond the default 35? Anything else I should be
looking at?


- Sincerely,
Dan Naumov
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: tuning vfs.zfs.vdev.max_pending and solving the issue of ZFS writes choking read IO

2010-03-24 Thread Bob Friesenhahn

On Wed, 24 Mar 2010, Dan Naumov wrote:

Has anyone done any extensive testing of the effects of tuning
vfs.zfs.vdev.max_pending on this issue? Is there some universally
recommended value beyond the default 35? Anything else I should be
looking at?


The vdev.max_pending value is primarily used to tune for SAN/HW-RAID 
LUNs and is used to dial down LUN service time (svc_t) values by 
limiting the number of pending requests.  It is not terribly useful 
for decreasing stalls due to zfs writes.  In order to reduce the 
impact of zfs writes, you want to limit the maximum size of a zfs 
transaction group (TXG).  I don't know what the FreeBSD tunable is for 
this, but under Solaris it is zfs:zfs_write_limit_override.


On a large-memory system, a properly working zfs should not saturate 
the write channel for more than 5 seconds.  Zfs tries to learn the 
write bandwidth so that it can tune the TXG size up to 5 seconds (max) 
worth of writes.  If you have both large memory and fast storage, 
quite a huge amount of data can be written in 5 seconds.  On my 
Solaris system, I found that zfs was quite accurate with its rate 
estimation, but it resulted in four gigabytes of data being written 
per TXG.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: tuning vfs.zfs.vdev.max_pending and solving the issue of ZFS writes choking read IO

2010-03-24 Thread Dan Nelson
In the last episode (Mar 24), Bob Friesenhahn said:
 On Wed, 24 Mar 2010, Dan Naumov wrote:
  Has anyone done any extensive testing of the effects of tuning
  vfs.zfs.vdev.max_pending on this issue?  Is there some universally
  recommended value beyond the default 35?  Anything else I should be
  looking at?
 
 The vdev.max_pending value is primarily used to tune for SAN/HW-RAID LUNs
 and is used to dial down LUN service time (svc_t) values by limiting the
 number of pending requests.  It is not terribly useful for decreasing
 stalls due to zfs writes.  In order to reduce the impact of zfs writes,
 you want to limit the maximum size of a zfs transaction group (TXG).  I
 don't know what the FreeBSD tunable is for this, but under Solaris it is
 zfs:zfs_write_limit_override.

There isn't a sysctl for it by default, but the following patch will enable
a vfs.zfs.write_limit_override sysctl:

Index: dsl_pool.c
===
RCS file: 
/home/ncvs/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c,v
retrieving revision 1.4.2.1
diff -u -p -r1.4.2.1 dsl_pool.c
--- dsl_pool.c  17 Aug 2009 09:55:58 -  1.4.2.1
+++ dsl_pool.c  11 Mar 2010 08:34:27 -
@@ -47,6 +47,11 @@ uint64_t zfs_write_limit_inflated = 0;
 uint64_t zfs_write_limit_override = 0;
 extern uint64_t zfs_write_limit_min;
 
+SYSCTL_DECL(_vfs_zfs);
+SYSCTL_QUAD(_vfs_zfs, OID_AUTO, write_limit_override, CTLFLAG_RW,
+   zfs_write_limit_override, 0,
+   Force a txg if dirty buffers exceed this value (bytes));
+
 kmutex_t zfs_write_limit_lock;
 
 static pgcnt_t old_physmem = 0;

 
 On a large-memory system, a properly working zfs should not saturate 
 the write channel for more than 5 seconds.  Zfs tries to learn the 
 write bandwidth so that it can tune the TXG size up to 5 seconds (max) 
 worth of writes.  If you have both large memory and fast storage, 
 quite a huge amount of data can be written in 5 seconds.  On my 
 Solaris system, I found that zfs was quite accurate with its rate 
 estimation, but it resulted in four gigabytes of data being written 
 per TXG.

I had similar problems on a 32GB Solaris server at work.  Note that with
compression enabled, the entire system pauses while it compresses the
outgoing block of data.  It's just a fraction of a second, but long enough
for end-users to complain about bad performance in X sessions.  I had to
throttle back to a 256MB write limit size to make the stuttering go away
completely.  It didn't affect write throughput much at all.

-- 
Dan Nelson
dnel...@allantgroup.com
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: tuning vfs.zfs.vdev.max_pending and solving the issue of ZFS writes choking read IO

2010-03-24 Thread Bob Friesenhahn

On Wed, 24 Mar 2010, Dan Nelson wrote:


I had similar problems on a 32GB Solaris server at work.  Note that with
compression enabled, the entire system pauses while it compresses the
outgoing block of data.  It's just a fraction of a second, but long enough
for end-users to complain about bad performance in X sessions.  I had to
throttle back to a 256MB write limit size to make the stuttering go away
completely.  It didn't affect write throughput much at all.


Apparently this was a kernel thread priority problem in Solaris.  It 
is apparently fixed in recent versions of OpenSolaris.  The fix 
required adding a scheduling class which allowed the kernel thread 
doing the compression to be less than the priority of normal user 
processes (such as the X11 server).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: tuning vfs.zfs.vdev.max_pending and solving the issue of ZFS writes choking read IO

2010-03-24 Thread Ivan Voras

Dan Naumov wrote:

Hello

I am having a slight issue (and judging by Google results, similar
issues have been seen by other FreeBSD and Solaris/OpenSolaris users)
with writes choking the read IO. The issue I am having is described
pretty well here:
http://opensolaris.org/jive/thread.jspa?threadID=106453 It seems that
under heavy write load, ZFS likes to aggregate a really huge amount of
data before actually writing it to disks, resulting in sudden 10+
second stalls where it frantically tries to commit everything,
completely choking read IO in the process and sometimes even the
network (with a large enough write to a mirror pool using DD, I can
cause my SSH sessions to drop dead, without actually running out of
RAM. As soon as the data is committed, I can reconnect back).


Mostly a wild guess, but can you test if this patch will help with 
choking your network and ssh:


http://people.freebsd.org/~ivoras/diffs/spa.c.diff

?

You can then fiddle with the vfs.zfs.zio_worker_threads_count loader 
tunable to see if it helps more.


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org