On Apr 2, 2010, at 5:03 AM, Edward Ned Harvey wrote:

>>> Seriously, all disks configured WriteThrough (spindle and SSD disks
>>> alike)
>>> using the dedicated ZIL SSD device, very noticeably faster than
>>> enabling the
>>> WriteBack.
>> 
>> What do you get with both SSD ZIL and WriteBack disks enabled?
>> 
>> I mean if you have both why not use both? Then both async and sync IO
>> benefits.
> 
> Interesting, but unfortunately false.  Soon I'll post the results here.  I
> just need to package them in a way suitable to give the public, and stick it
> on a website.  But I'm fighting IT fires for now and haven't had the time
> yet.
> 
> Roughly speaking, the following are approximately representative.  Of course
> it varies based on tweaks of the benchmark and stuff like that.
>       Stripe 3 mirrors write through:  450-780 IOPS
>       Stripe 3 mirrors write back:  1030-2130 IOPS
>       Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
>       Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Thanks for sharing these interesting numbers.

> Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
> ZIL is 3-4 times faster than naked disk.  And for some reason, having the
> WriteBack enabled while you have SSD ZIL actually hurts performance by
> approx 10%.  You're better off to use the SSD ZIL with disks in Write
> Through mode.

YMMV. The write workload for ZFS is best characterized by looking at
the txg commit.  In a very short period of time ZFS sends a lot[1] of write
I/O to the vdevs. It is not surprising that this can blow through the 
relatively small caches on controllers. Once you blow through the cache,
then the [in]efficiency of the disks behind the cache is experienced as
well as the [in]efficiency of the cache controller. Alas, little public 
information seems to be published regarding how those caches work. 

Changing to write-through effectively changes the G/M/1 queue [2]
at the controller to a G/M/n queue at the disks.  Sorta like:
        1. write-back controller
                (ZFS) N*#vdev I/Os --> controller --> disks
                (ZFS) M/M/n --> G/M/1 --> M/M/n

        2. write-through controller
                (ZFS) N*#vdev I/Os  --> disks
                (ZFS) M/M/n  --> G/M/n

This can simply be a case of the middleman becoming the bottleneck.

[1] a "lot" means up to 35 I/Os per vdev for older releases, 4-10 I/Os per
vdev for more recent releases

[2] queuing theory enthusiasts will note that ZFS writes do not exhibit an
exponential arrival rate at the controller or disks except for sync writes.

> That result is surprising to me.  But I have a theory to explain it.  When
> you have WriteBack enabled, the OS issues a small write, and the HBA
> immediately returns to the OS:  "Yes, it's on nonvolatile storage."  So the
> OS quickly gives it another, and another, until the HBA write cache is full.
> Now the HBA faces the task of writing all those tiny writes to disk, and the
> HBA must simply follow orders, writing a tiny chunk to the sector it said it
> would write, and so on.  The HBA cannot effectively consolidate the small
> writes into a larger sequential block write.  But if you have the WriteBack
> disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
> SSD, and immediately return to the process:  "Yes, it's on nonvolatile
> storage."  So the application can issue another, and another, and another.
> ZFS is smart enough to aggregate all these tiny write operations into a
> single larger sequential write before sending it to the spindle disks.  

I agree, though this paragraph has 3 different thoughts embedded.
Taken separately:
        1. queuing surprises people :-)
        2. writeback inserts a middleman with its own queue
        3. separate logs radically change the write workload seen by
           the controller and disks

> Long story short, the evidence suggests if you have SSD ZIL, you're better
> off without WriteBack on the HBA.  And I conjecture the reasoning behind it
> is because ZFS can write buffer better than the HBA can.

I think the way the separate log works is orthogonal. However, not 
having a separate log can influence the ability of the controller and
disks to respond to read requests during this workload.  

Perhaps this is a long way around to saying that a well tuned system
will have harmony among its parts.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to