Hello was hoping that someone could clarify the behavior of metrics
sampling and the impact on the execute latency calculation.

In this specific case let's assume that we are using a sampling rate of
.05, the default, so that 5 out of every 100 tuples are used for
calculating execute latency.

In BoltExecutor.java from line 105-135

https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/executor/bolt/BoltExecutor.java#L134

The code determines if the tuple should be included in the sampling and if
so calculates the delta in milliseconds that it took to call the execute
method of the bolt.  If the tuple was not included in the sample the delta
is set to 0.

However later in the code a call is made to ((BoltExecutorStats)
stats).boltExecuteTuple(...)
in all cases as it checks delta>=0. Some layers down in the code a call to
record() is made
https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/metric/internal/LatencyStatAndMetric.java#L124

which increments the count of samples in the current bucket.  Eventually an
average is calculated using the total latency over the interval / the count
of samples in the bucket.

I'm wondering does this lead to a miscalculation?

Specifically I would expect that if we are sampling 5 tuples per 100 and we
only get 100 tuples per minute (or calculation interval) that the average
would be

total latency/5

In this case it would seem to be

total latency/100

Since the sample count is increased even when delta==0.  I could be missing
something obvious as I'm far from an expert in this code.

Also the basic granularity of measurement in the code is millisecond. For
systems that process tuples sub-millisecond is there a way to make this
more granular, microseconds at least?

Thanks.

Reply via email to