[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.

Nathan Kroenert Wed, 06 Feb 2008 22:55:09 -0800

Hey all -

I'm working on an interesting issue where I'm seeing ZFS being quite 
cranky about writing O_SYNC written blocks.


Bottom line is that I have a small test case that does essentially this:

open file for writing  -- O_SYNC
loop(
        write() 8KB of random data
        print time taken to write data
}

It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' 
is not in evidence, (and it's not always happening), I can do around 
1200 O_SYNC writes per second...

It seems to be waiting here virtually all of the time:

 > 0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v
stack pointer for thread 30171352960: 2a118052df1
[ 000002a118052df1 cv_wait+0x38() ]
   000002a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570,
   60005db6640)
   000002a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 
2000,
   2a1180539d8)
   000002a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 
300a27a9e48, 0,
   7b7462d0)
   000002a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1)
   000002a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80,
   ff3a00c0, ff3a0100)
   000002a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0,
   ff3a0100)

And this also evident in a dtrace of it, following the write in...

<...>
  28        -> zil_commit
  28          -> cv_wait
  28            -> thread_lock
  28            <- thread_lock
  28            -> cv_block
  28              -> ts_sleep
  28              <- ts_sleep
  28              -> new_mstate
  28                -> cpu_update_pct
  28                  -> cpu_grow
  28                    -> cpu_decay
  28                      -> exp_x
  28                      <- exp_x
  28                    <- cpu_decay
  28                  <- cpu_grow
  28                <- cpu_update_pct
  28              <- new_mstate
  28              -> disp_lock_enter_high
  28              <- disp_lock_enter_high
  28              -> disp_lock_exit_high
  28              <- disp_lock_exit_high
  28            <- cv_block
  28            -> sleepq_insert
  28            <- sleepq_insert
  28            -> disp_lock_exit_nopreempt
  28            <- disp_lock_exit_nopreempt
  28            -> swtch
  28              -> disp
  28                -> disp_lock_enter
  28                <- disp_lock_enter
  28                -> disp_lock_exit
  28                <- disp_lock_exit
  28                -> disp_getwork
  28                <- disp_getwork
  28                -> restore_mstate
  28                <- restore_mstate
  28              <- disp
  28              -> pg_cmt_load
  28              <- pg_cmt_load
  28            <- swtch
  28            -> resume
  28              -> savectx
  28                -> schedctl_save
  28                <- schedctl_save
  28              <- savectx
<...>

At this point, it waits for up to 80 seconds.

I'm also seeing zil_commit() being called around 7-15 times per second.

For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a 
pinch of difference. :)

For what it's worth, this is a T2000, running Oracle, connected to an 
HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle 
filesystems, and I'm only seeing the issue on the ZFS filesystems that 
have the active oracle tables on them.

The O_SYNC test case is just trying to help me understand what's 
happening. The *real* problem is that oracle is running like rubbish 
when it's trying to roll forward archive logs from another server. It's 
an almost 100% write workload. At the moment, it cannot even keep up 
with the other server's log creation rate, and it's barely doing 
anything. (The other box is quite different, so not really valid for 
direct comparison at this point).

6513020 looked interesting for a while, but I already have 120011-14 and 
127111-03 and installed.

I'm looking into the cache flush settings of the 9990 array to see if 
it's that killing me, but I'm also looking for any other ideas on what 
might be hurting me.

I also have set
        zfs:zfs_nocacheflush = 1
in /etc/system

The Oracle Logs are on a separate Zpool and I'm not seeing the issue on 
those filesystems.

The lockstats I have run are not yet all that interesting. If anyone has 
ideas on specific incantations I should use or some specific D or 
anything else, I'd be most appreciative.

Cheers!

Nathan.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.

Reply via email to