[zfs-code] Disk Writes

Mark Maybee Thu, 05 Feb 2009 16:10:53 -0700

Ben Rockwood wrote:
> Mark Maybee wrote:
>> Ben Rockwood wrote:
>>> I need some help with clarification.
>>>
>>> My understanding is that there are 2 instances in which ZFS will write
>>> to disk:
>>> 1) TXG Sync
>>> 2) ZIL
>>>
>>> Post-snv_87 a TXG should sync out when the TXG is either over filled or
>>> hits the timeout of 30 seconds.
>>>
>>> First question is... is there some place I can see what this max TXG
>>> size is?  If I recall its 1/8th of system memory... but there has to be
>>> a counter somewhere right?
>>>
>> There is both a memory throttle limit (enforced in arc_memory_throttle)
>> and a write throughput throttle limit (calculated in dsl_pool_sync(),
>> enforced in dsl_pool_tempreserve_space()).  The write limit is stored as
>> the 'dp_write_limit' for each pool.
> 
> I cooked up the following:
> $ dtrace -qn fbt::dsl_pool_sync:entry'{ printf("Throughput is \t %d\n
> write limit is\t %d\n\n", args[0]->dp_throughput,
> args[0]->dp_write_limit); }'
> Throughput is    883975129
>  write limit is  3211748352
> 
> I'm confused with regard to the units and interpretation. 
> 
> For instance, the write limit here is almost 3GB on a system with 4GB of
> RAM.  However, if I read the code right the value here is already
> inflated *6... so the real write limit is actually 510MB right?
> 
The write_limit is independent of the memory size.  Its based purely
on the IO bandwidth available to the pool.  So a write_limit of 3GB
implies that we think that we can push 3GB of (inflated) data in 5
seconds to the drives.  If we take out the inflation, this means
we think we can push 100MB/s to the pools drives.


> As for the throughput, I need verification... I think the unit here is
> bytes per second?
> 
Correct.
> 
>>> I'm unclear on ZIL writes.   I think that they happen independently of
>>> the normal txg rotation, but I'm not sure.
>>>
>>> So the second question is: do they happen with a TXG sync (expitied) or
>>> independent of the normal TXG sync flow?
>>>
>>> Finally, I'm unclear on exactly what constitutes a TXG Stall.  I had
>>> assumed that it indicated TXG's that exceeded the alloted time, but
>>> after some dtracing I'm uncertain.
>>>
>> I'm not certain what you mean by: "TXG Stall".
> 
> I refer to the following code, which I'm having some trouble properly
> understanding:
> 
>    475 txg_stalled(dsl_pool_t *dp)
>     476 {
>     477     tx_state_t *tx = &dp->dp_tx;
>     478     return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg);
>     479 }
> 
Ah.  A "stall" in this context means that the sync phase is idle,
waiting for the next txg to quiesce.... so the the current train
is "stalled" until the quiesce finishes.
> 
> 
> Ultimately, what this all comes down to is finding a reliable way to
> determine when ZFS is struggling.  I'm currently watching (on pre-87)
> txg sync times and if it exceeds like 4 seconds per txg sync I know
> their is trouble brewing.  I'm considering whether watching either
> txg_stalled or txg_delay may be better ways to flag trouble.
> 
Stalled tends to mean that there is something happening "up top"
preventing things from moving (i.e., a tx not closing).  Delay is
used when we are trying to push more data than the pool can handle,
so that may be what you want to look at.

> dp_throughput looks like it also might be a good candidate, although it
> was only added in snv_98 unfortunate so it doesn't help a lot of my
> existing installs.  Nevertheless, graphing this value could be very
> telling and would be nice to have available as a kstat.
> 
agreed.

> The intended result is to have a reliable means of monitoring (via a
> standard monitoring framework such as Nagios or Zabbix or something) ZFS
> health... and from my studies simply watching traditional values via
> iostat isn't the best method.  If ZIL is either disabled or pushing to
> SLOG, then watching the breathing of TXG sync's should be all thats
> really important to me, at least on the write side... thats my theory
> anyway.  Feel free to flog me. :)
> 
Nope, that makes sense to me.  Ideally, we should either be chugging
along at 5-to-30 second intervals between syncs with no delays (i.e.
a light IO load), or we should be doing consistent 5s syncs with a
few delays seen (max capacity).  If you start seeing lots of delays,
you are probably trying to push too much data.

Note that we are still tuning this code.  Recently we discovered that
we may want to change the throughput calculation (we currently don't
include the dsl_dataset_sync() calls in the calc... we may want to
change that) to include more of the IO "setup" time.

> Thank you very much for your help Mark!
> 
> benr.

[zfs-code] Disk Writes

Reply via email to