On Apr 2, 2010, at 5:03 AM, Edward Ned Harvey wrote: >>> Seriously, all disks configured WriteThrough (spindle and SSD disks >>> alike) >>> using the dedicated ZIL SSD device, very noticeably faster than >>> enabling the >>> WriteBack. >> >> What do you get with both SSD ZIL and WriteBack disks enabled? >> >> I mean if you have both why not use both? Then both async and sync IO >> benefits. > > Interesting, but unfortunately false. Soon I'll post the results here. I > just need to package them in a way suitable to give the public, and stick it > on a website. But I'm fighting IT fires for now and haven't had the time > yet. > > Roughly speaking, the following are approximately representative. Of course > it varies based on tweaks of the benchmark and stuff like that. > Stripe 3 mirrors write through: 450-780 IOPS > Stripe 3 mirrors write back: 1030-2130 IOPS > Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS > Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS
Thanks for sharing these interesting numbers. > Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD > ZIL is 3-4 times faster than naked disk. And for some reason, having the > WriteBack enabled while you have SSD ZIL actually hurts performance by > approx 10%. You're better off to use the SSD ZIL with disks in Write > Through mode. YMMV. The write workload for ZFS is best characterized by looking at the txg commit. In a very short period of time ZFS sends a lot[1] of write I/O to the vdevs. It is not surprising that this can blow through the relatively small caches on controllers. Once you blow through the cache, then the [in]efficiency of the disks behind the cache is experienced as well as the [in]efficiency of the cache controller. Alas, little public information seems to be published regarding how those caches work. Changing to write-through effectively changes the G/M/1 queue [2] at the controller to a G/M/n queue at the disks. Sorta like: 1. write-back controller (ZFS) N*#vdev I/Os --> controller --> disks (ZFS) M/M/n --> G/M/1 --> M/M/n 2. write-through controller (ZFS) N*#vdev I/Os --> disks (ZFS) M/M/n --> G/M/n This can simply be a case of the middleman becoming the bottleneck. [1] a "lot" means up to 35 I/Os per vdev for older releases, 4-10 I/Os per vdev for more recent releases [2] queuing theory enthusiasts will note that ZFS writes do not exhibit an exponential arrival rate at the controller or disks except for sync writes. > That result is surprising to me. But I have a theory to explain it. When > you have WriteBack enabled, the OS issues a small write, and the HBA > immediately returns to the OS: "Yes, it's on nonvolatile storage." So the > OS quickly gives it another, and another, until the HBA write cache is full. > Now the HBA faces the task of writing all those tiny writes to disk, and the > HBA must simply follow orders, writing a tiny chunk to the sector it said it > would write, and so on. The HBA cannot effectively consolidate the small > writes into a larger sequential block write. But if you have the WriteBack > disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on > SSD, and immediately return to the process: "Yes, it's on nonvolatile > storage." So the application can issue another, and another, and another. > ZFS is smart enough to aggregate all these tiny write operations into a > single larger sequential write before sending it to the spindle disks. I agree, though this paragraph has 3 different thoughts embedded. Taken separately: 1. queuing surprises people :-) 2. writeback inserts a middleman with its own queue 3. separate logs radically change the write workload seen by the controller and disks > Long story short, the evidence suggests if you have SSD ZIL, you're better > off without WriteBack on the HBA. And I conjecture the reasoning behind it > is because ZFS can write buffer better than the HBA can. I think the way the separate log works is orthogonal. However, not having a separate log can influence the ability of the controller and disks to respond to read requests during this workload. Perhaps this is a long way around to saying that a well tuned system will have harmony among its parts. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss