Hey all - I'm working on an interesting issue where I'm seeing ZFS being quite cranky about writing O_SYNC written blocks.
Bottom line is that I have a small test case that does essentially this: open file for writing -- O_SYNC loop( write() 8KB of random data print time taken to write data } It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' is not in evidence, (and it's not always happening), I can do around 1200 O_SYNC writes per second... It seems to be waiting here virtually all of the time: > 0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v stack pointer for thread 30171352960: 2a118052df1 [ 000002a118052df1 cv_wait+0x38() ] 000002a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570, 60005db6640) 000002a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 2000, 2a1180539d8) 000002a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 300a27a9e48, 0, 7b7462d0) 000002a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1) 000002a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) 000002a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0, ff3a0100) And this also evident in a dtrace of it, following the write in... <...> 28 -> zil_commit 28 -> cv_wait 28 -> thread_lock 28 <- thread_lock 28 -> cv_block 28 -> ts_sleep 28 <- ts_sleep 28 -> new_mstate 28 -> cpu_update_pct 28 -> cpu_grow 28 -> cpu_decay 28 -> exp_x 28 <- exp_x 28 <- cpu_decay 28 <- cpu_grow 28 <- cpu_update_pct 28 <- new_mstate 28 -> disp_lock_enter_high 28 <- disp_lock_enter_high 28 -> disp_lock_exit_high 28 <- disp_lock_exit_high 28 <- cv_block 28 -> sleepq_insert 28 <- sleepq_insert 28 -> disp_lock_exit_nopreempt 28 <- disp_lock_exit_nopreempt 28 -> swtch 28 -> disp 28 -> disp_lock_enter 28 <- disp_lock_enter 28 -> disp_lock_exit 28 <- disp_lock_exit 28 -> disp_getwork 28 <- disp_getwork 28 -> restore_mstate 28 <- restore_mstate 28 <- disp 28 -> pg_cmt_load 28 <- pg_cmt_load 28 <- swtch 28 -> resume 28 -> savectx 28 -> schedctl_save 28 <- schedctl_save 28 <- savectx <...> At this point, it waits for up to 80 seconds. I'm also seeing zil_commit() being called around 7-15 times per second. For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a pinch of difference. :) For what it's worth, this is a T2000, running Oracle, connected to an HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle filesystems, and I'm only seeing the issue on the ZFS filesystems that have the active oracle tables on them. The O_SYNC test case is just trying to help me understand what's happening. The *real* problem is that oracle is running like rubbish when it's trying to roll forward archive logs from another server. It's an almost 100% write workload. At the moment, it cannot even keep up with the other server's log creation rate, and it's barely doing anything. (The other box is quite different, so not really valid for direct comparison at this point). 6513020 looked interesting for a while, but I already have 120011-14 and 127111-03 and installed. I'm looking into the cache flush settings of the 9990 array to see if it's that killing me, but I'm also looking for any other ideas on what might be hurting me. I also have set zfs:zfs_nocacheflush = 1 in /etc/system The Oracle Logs are on a separate Zpool and I'm not seeing the issue on those filesystems. The lockstats I have run are not yet all that interesting. If anyone has ideas on specific incantations I should use or some specific D or anything else, I'd be most appreciative. Cheers! Nathan. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss