Hey Peter,

Thanks for taking a look! You've pointed out many things I've considered.

I initially pitched this version as exactly what you described, a
convenience wrapper around KSTAT_TYPE_NAMED kstats, simply to reduce the
amount of code copied around the kernel to implement such a thing. Your
criticisms are perfectly accurate. The initial design wasn't intended to be
useful for anything other than tracking the duration of longer-running
kernel operations (in the order of hundreds to thousands of milliseconds).
A more flexible and generic way to create kstat histograms would be
fantastic.

I will happily consider a more broad approach to this, which might involve
more changes to the kstat framework and, obviously, more work. It's
exciting that you could so easily point out the value of having such a
thing.

I just have a couple comments on your suggestions.

Looking at that, the data includes not just the buckets, but also ancillary
> data
> (the total_ms and count fields). As someone who writes tools to consume
> this data, that implies a lot of additional logic, rather than just being
> able to assume
> that every data item is a bucket.
>
This is definitely true. However, tracking the number of operations and the
sum of their values makes deducing a precise average trivial (total_ms /
count). Without two fields like this it would not be possible to compute a
precise average of values observed. As the buckets get 'larger' (1000ms,
10000ms), the precision is lost more quickly. I view having fields like
these a way of compensating for the lack of precision that a bucket design
gives users. The count field is almost unnecessary, except for the cases
where observed values fall out of the range of the smallest bucket (less
than 10ms in the example).

In a more complex kstat histogram design it might also be useful for
callers to store arbitrary metadata in the kstat. The zone_vfs example I
provided does this with the 'zonename' field, for example. If we were
tracking a distribution of packet sizes we might want to include 'vnic' and
'zonename' fields so higher-level consumers could aggregate based on the
vnic or zone. Unfortunately the kstat 'name' is the only out-of-the-box
customizable metadata in the kstat framework, which can quickly get
unwieldy.

Would you ever want to replace the counters wholesale?
I don't think I understand this point. Are you referring to resetting the
entire histogram, or something else?

Would you ever want to increase the counter by a number other than 1?
I've personally never needed to do this, but if we would hand the bucket
calculation to the module then it can do whatever it pleases.

Thanks. I'll see about a more generic design.
Kody

On Thu, Dec 6, 2018 at 7:57 AM Peter Tribble via openzfs-developer <
developer@lists.open-zfs.org> wrote:

>
>
> On Mon, Dec 3, 2018 at 7:31 PM Kody Kantor <kody.kan...@joyent.com> wrote:
>
>> Hi folks,
>>
>> illumos-joyent includes per-zone VFS kstats that emulate a histogram by
>> aggregating the latency of individual VFS operations into buckets.
>>
>
> Interesting! Thanks for putting this together.
>
>
>> Here's a sample of what that looks like:
>>
>> [root@coke ~]# kstat -m zone_vfs
>> module: zone_vfs                        instance: 0
>> name:   global                          class:    zone_vfs
>>         100ms_ops                       164
>>         10ms_ops                        1734
>>         10s_ops                         0
>>         1s_ops                          23
>>         ...
>>         zonename                        global
>>
>> Using this kstat we can pretty easily see that there were 23 VFS
>> operations issued by the global zone that took at least one second but not
>> more than 10 seconds, for example. This seems like a useful pattern, and
>> one that could be used in many places in the kernel. For those curious,
>> this was implemented as part of OS-338 (
>> https://smartos.org/bugview/OS-338, illumos-joyent commit
>> 278c9a7a00cbf7a53131a08ae5e86c28de7d2c8e).
>>
>> If we'd like to use the zone_vfs module's histogram pattern elsewhere in
>> the kernel we currently have to copy most of the code used by the zone_vfs
>> kstat module. I'd like to add a minimal histogram abstraction to the kstat
>> framework so that other kernel systems can easily create and use histograms
>> to track latency of longer-running operations (in the order of tens to
>> hundreds of milliseconds).
>>
>> Specifically, a kernel component would call a function like `(kstat_t *)
>> kstat_histogram_create(module, instance, name)` which would create a kstat
>> that has the same latency buckets as zone_vfs.
>>
>
> This is quite limiting. Generally:
>
>  - Not all cases will have the same latency buckets
>  - Not all cases will have the same number of buckets
>  - Not all cases will be based on latency
>
> On the last point, I can imagine some other use cases:
>
> 1. Distribution of network packet sizes. (I've seen network cards where
> the hardware provides this information - eg bge.)
>
> 2. Distribution of I/O sizes
>
> 3. And I even wonder whether the kmem_alloc kstats might be better thought
> of
> as a set of per-operation kstats with histograms by size, rather than as
> per-size
> kstats each with a list of operations.
>
> Then the kernel component would call `(void)
>> kstat_histogram_observe(kstat, hrtime)` to increment the relevant latency
>> buckets for the given kstat histogram and duration.
>>
>
> Would you ever want to increase the counter by a number other than 1?
>
> Would you ever want to replace the counters wholesale?
>
> What happens if you were to pass a random kstat (one that isn't a
> histogram) in as the
> first argument?
>
> More generally, why put the binning logic into the kstat framework rather
> than
> devolve it to the individual module?
>
>
>> Here's a simplified example of what I had in mind:
>>
>> kstat_t *mystat = kstat_histogram_create("testmodule", 0,
>> "my_latency_stat");
>> if (mystat != NULL) {
>>     kstat_install(mystat);
>> }
>>
>> htime_t start = gethrtime();
>> do_work(); // a longer-running task
>> hrtime_t duration = gethrtime() - start;
>> kstat_histogram_observe(mystat, duration);
>>
>> which could look something like this on the front-end after a few rounds
>> of calling do_work():
>>
>> $ kstat -m testmodule
>> module: testmodule                      instance: 0
>> name:   my_latency_stat                 class:    hist
>> bucket_10ms                     3
>> bucket_100ms                    2
>> bucket_1s                       1
>> bucket_10s                      0
>> total_ms                        1210
>> count                           3
>> crtime                          138.923698981
>> snaptime                        144.456348556
>>
>
> Looking at that, the data includes not just the buckets, but also
> ancillary data
> (the total_ms and count fields). As someone who writes tools to consume
> this data, that implies a lot of additional logic, rather than just being
> able to assume
> that every data item is a bucket.
>
> Furthermore, changing units in the names (from ms to s here) is also
> undesirable -
> it would be better if the field names sorted naturally, which is easiest
> if the units
> stay the same.
>
> You've used "hist" as the class here. Generally class would be used to
> select items
> by logical functionality, not data layout.
>
>
>> This would be more abstract than what's currently included in the kstat
>> framework. My intent is that we could make something that's easy to use for
>> the majority of users in the kernel. The above approach would trade
>> flexibility for simplicity. Maybe there's a way to retain more flexibility
>> without adding too much complexity.
>>
>
> I think the fundamental thing here is whether this is just a convenience
> wrapper
> for injecting data into what is otherwise a standard KSTAT_TYPE_NAMED
> kstat,
> or whether there's something more fundamental - possibly even a new kstat
> type,
> that would be useful.
>
>
>> The initial use case I'm considering for this would be to add a few
>> histogram kstats for longer-running ZFS operations, like spa_sync,
>> dmu_tx_delay, and zil_commit. For this use case I think the above design
>> works well, but there are definitely drawbacks to such a simple design.
>>
>> Thoughts on this?
>>
>
> As a final thought, if you were to go with your initial design, then using
> the name
> "histogram" is inappropriate, as what you have is far more specific.
>
> Thanks again,
>
> --
> -Peter Tribble
> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see
> discussions <https://openzfs.topicbox.com/groups/developer> + participants
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options
> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
> <https://openzfs.topicbox.com/groups/developer/T0d45cb43ed98ec4c-Mbce26a92057d1081921ce072>
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T0d45cb43ed98ec4c-M92f5ef54f946ad7c1b7612ad
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to