Hey Peter, Thanks for taking a look! You've pointed out many things I've considered.
I initially pitched this version as exactly what you described, a convenience wrapper around KSTAT_TYPE_NAMED kstats, simply to reduce the amount of code copied around the kernel to implement such a thing. Your criticisms are perfectly accurate. The initial design wasn't intended to be useful for anything other than tracking the duration of longer-running kernel operations (in the order of hundreds to thousands of milliseconds). A more flexible and generic way to create kstat histograms would be fantastic. I will happily consider a more broad approach to this, which might involve more changes to the kstat framework and, obviously, more work. It's exciting that you could so easily point out the value of having such a thing. I just have a couple comments on your suggestions. Looking at that, the data includes not just the buckets, but also ancillary > data > (the total_ms and count fields). As someone who writes tools to consume > this data, that implies a lot of additional logic, rather than just being > able to assume > that every data item is a bucket. > This is definitely true. However, tracking the number of operations and the sum of their values makes deducing a precise average trivial (total_ms / count). Without two fields like this it would not be possible to compute a precise average of values observed. As the buckets get 'larger' (1000ms, 10000ms), the precision is lost more quickly. I view having fields like these a way of compensating for the lack of precision that a bucket design gives users. The count field is almost unnecessary, except for the cases where observed values fall out of the range of the smallest bucket (less than 10ms in the example). In a more complex kstat histogram design it might also be useful for callers to store arbitrary metadata in the kstat. The zone_vfs example I provided does this with the 'zonename' field, for example. If we were tracking a distribution of packet sizes we might want to include 'vnic' and 'zonename' fields so higher-level consumers could aggregate based on the vnic or zone. Unfortunately the kstat 'name' is the only out-of-the-box customizable metadata in the kstat framework, which can quickly get unwieldy. Would you ever want to replace the counters wholesale? I don't think I understand this point. Are you referring to resetting the entire histogram, or something else? Would you ever want to increase the counter by a number other than 1? I've personally never needed to do this, but if we would hand the bucket calculation to the module then it can do whatever it pleases. Thanks. I'll see about a more generic design. Kody On Thu, Dec 6, 2018 at 7:57 AM Peter Tribble via openzfs-developer < developer@lists.open-zfs.org> wrote: > > > On Mon, Dec 3, 2018 at 7:31 PM Kody Kantor <kody.kan...@joyent.com> wrote: > >> Hi folks, >> >> illumos-joyent includes per-zone VFS kstats that emulate a histogram by >> aggregating the latency of individual VFS operations into buckets. >> > > Interesting! Thanks for putting this together. > > >> Here's a sample of what that looks like: >> >> [root@coke ~]# kstat -m zone_vfs >> module: zone_vfs instance: 0 >> name: global class: zone_vfs >> 100ms_ops 164 >> 10ms_ops 1734 >> 10s_ops 0 >> 1s_ops 23 >> ... >> zonename global >> >> Using this kstat we can pretty easily see that there were 23 VFS >> operations issued by the global zone that took at least one second but not >> more than 10 seconds, for example. This seems like a useful pattern, and >> one that could be used in many places in the kernel. For those curious, >> this was implemented as part of OS-338 ( >> https://smartos.org/bugview/OS-338, illumos-joyent commit >> 278c9a7a00cbf7a53131a08ae5e86c28de7d2c8e). >> >> If we'd like to use the zone_vfs module's histogram pattern elsewhere in >> the kernel we currently have to copy most of the code used by the zone_vfs >> kstat module. I'd like to add a minimal histogram abstraction to the kstat >> framework so that other kernel systems can easily create and use histograms >> to track latency of longer-running operations (in the order of tens to >> hundreds of milliseconds). >> >> Specifically, a kernel component would call a function like `(kstat_t *) >> kstat_histogram_create(module, instance, name)` which would create a kstat >> that has the same latency buckets as zone_vfs. >> > > This is quite limiting. Generally: > > - Not all cases will have the same latency buckets > - Not all cases will have the same number of buckets > - Not all cases will be based on latency > > On the last point, I can imagine some other use cases: > > 1. Distribution of network packet sizes. (I've seen network cards where > the hardware provides this information - eg bge.) > > 2. Distribution of I/O sizes > > 3. And I even wonder whether the kmem_alloc kstats might be better thought > of > as a set of per-operation kstats with histograms by size, rather than as > per-size > kstats each with a list of operations. > > Then the kernel component would call `(void) >> kstat_histogram_observe(kstat, hrtime)` to increment the relevant latency >> buckets for the given kstat histogram and duration. >> > > Would you ever want to increase the counter by a number other than 1? > > Would you ever want to replace the counters wholesale? > > What happens if you were to pass a random kstat (one that isn't a > histogram) in as the > first argument? > > More generally, why put the binning logic into the kstat framework rather > than > devolve it to the individual module? > > >> Here's a simplified example of what I had in mind: >> >> kstat_t *mystat = kstat_histogram_create("testmodule", 0, >> "my_latency_stat"); >> if (mystat != NULL) { >> kstat_install(mystat); >> } >> >> htime_t start = gethrtime(); >> do_work(); // a longer-running task >> hrtime_t duration = gethrtime() - start; >> kstat_histogram_observe(mystat, duration); >> >> which could look something like this on the front-end after a few rounds >> of calling do_work(): >> >> $ kstat -m testmodule >> module: testmodule instance: 0 >> name: my_latency_stat class: hist >> bucket_10ms 3 >> bucket_100ms 2 >> bucket_1s 1 >> bucket_10s 0 >> total_ms 1210 >> count 3 >> crtime 138.923698981 >> snaptime 144.456348556 >> > > Looking at that, the data includes not just the buckets, but also > ancillary data > (the total_ms and count fields). As someone who writes tools to consume > this data, that implies a lot of additional logic, rather than just being > able to assume > that every data item is a bucket. > > Furthermore, changing units in the names (from ms to s here) is also > undesirable - > it would be better if the field names sorted naturally, which is easiest > if the units > stay the same. > > You've used "hist" as the class here. Generally class would be used to > select items > by logical functionality, not data layout. > > >> This would be more abstract than what's currently included in the kstat >> framework. My intent is that we could make something that's easy to use for >> the majority of users in the kernel. The above approach would trade >> flexibility for simplicity. Maybe there's a way to retain more flexibility >> without adding too much complexity. >> > > I think the fundamental thing here is whether this is just a convenience > wrapper > for injecting data into what is otherwise a standard KSTAT_TYPE_NAMED > kstat, > or whether there's something more fundamental - possibly even a new kstat > type, > that would be useful. > > >> The initial use case I'm considering for this would be to add a few >> histogram kstats for longer-running ZFS operations, like spa_sync, >> dmu_tx_delay, and zil_commit. For this use case I think the above design >> works well, but there are definitely drawbacks to such a simple design. >> >> Thoughts on this? >> > > As a final thought, if you were to go with your initial design, then using > the name > "histogram" is inappropriate, as what you have is far more specific. > > Thanks again, > > -- > -Peter Tribble > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see > discussions <https://openzfs.topicbox.com/groups/developer> + participants > <https://openzfs.topicbox.com/groups/developer/members> + delivery options > <https://openzfs.topicbox.com/groups/developer/subscription> Permalink > <https://openzfs.topicbox.com/groups/developer/T0d45cb43ed98ec4c-Mbce26a92057d1081921ce072> > ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T0d45cb43ed98ec4c-M92f5ef54f946ad7c1b7612ad Delivery options: https://openzfs.topicbox.com/groups/developer/subscription