Hey all -
I'm working on an interesting issue where I'm seeing ZFS being quite
cranky about writing O_SYNC written blocks.
Bottom line is that I have a small test case that does essentially this:
open file for writing -- O_SYNC
loop(
write() 8KB of random data
print time taken to write data
}
It's taking anywhere up to 80 seconds per 8KB block. When the 'problem'
is not in evidence, (and it's not always happening), I can do around
1200 O_SYNC writes per second...
It seems to be waiting here virtually all of the time:
> 0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v
stack pointer for thread 30171352960: 2a118052df1
[ 02a118052df1 cv_wait+0x38() ]
02a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570,
60005db6640)
02a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840,
2000,
2a1180539d8)
02a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10,
300a27a9e48, 0,
7b7462d0)
02a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1)
02a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80,
ff3a00c0, ff3a0100)
02a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0,
ff3a0100)
And this also evident in a dtrace of it, following the write in...
<...>
28-> zil_commit
28 -> cv_wait
28-> thread_lock
28<- thread_lock
28-> cv_block
28 -> ts_sleep
28 <- ts_sleep
28 -> new_mstate
28-> cpu_update_pct
28 -> cpu_grow
28-> cpu_decay
28 -> exp_x
28 <- exp_x
28<- cpu_decay
28 <- cpu_grow
28<- cpu_update_pct
28 <- new_mstate
28 -> disp_lock_enter_high
28 <- disp_lock_enter_high
28 -> disp_lock_exit_high
28 <- disp_lock_exit_high
28<- cv_block
28-> sleepq_insert
28<- sleepq_insert
28-> disp_lock_exit_nopreempt
28<- disp_lock_exit_nopreempt
28-> swtch
28 -> disp
28-> disp_lock_enter
28<- disp_lock_enter
28-> disp_lock_exit
28<- disp_lock_exit
28-> disp_getwork
28<- disp_getwork
28-> restore_mstate
28<- restore_mstate
28 <- disp
28 -> pg_cmt_load
28 <- pg_cmt_load
28<- swtch
28-> resume
28 -> savectx
28-> schedctl_save
28<- schedctl_save
28 <- savectx
<...>
At this point, it waits for up to 80 seconds.
I'm also seeing zil_commit() being called around 7-15 times per second.
For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a
pinch of difference. :)
For what it's worth, this is a T2000, running Oracle, connected to an
HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle
filesystems, and I'm only seeing the issue on the ZFS filesystems that
have the active oracle tables on them.
The O_SYNC test case is just trying to help me understand what's
happening. The *real* problem is that oracle is running like rubbish
when it's trying to roll forward archive logs from another server. It's
an almost 100% write workload. At the moment, it cannot even keep up
with the other server's log creation rate, and it's barely doing
anything. (The other box is quite different, so not really valid for
direct comparison at this point).
6513020 looked interesting for a while, but I already have 120011-14 and
127111-03 and installed.
I'm looking into the cache flush settings of the 9990 array to see if
it's that killing me, but I'm also looking for any other ideas on what
might be hurting me.
I also have set
zfs:zfs_nocacheflush = 1
in /etc/system
The Oracle Logs are on a separate Zpool and I'm not seeing the issue on
those filesystems.
The lockstats I have run are not yet all that interesting. If anyone has
ideas on specific incantations I should use or some specific D or
anything else, I'd be most appreciative.
Cheers!
Nathan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss