Hello was hoping that someone could clarify the behavior of metrics sampling and the impact on the execute latency calculation.
In this specific case let's assume that we are using a sampling rate of .05, the default, so that 5 out of every 100 tuples are used for calculating execute latency. In BoltExecutor.java from line 105-135 https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/executor/bolt/BoltExecutor.java#L134 The code determines if the tuple should be included in the sampling and if so calculates the delta in milliseconds that it took to call the execute method of the bolt. If the tuple was not included in the sample the delta is set to 0. However later in the code a call is made to ((BoltExecutorStats) stats).boltExecuteTuple(...) in all cases as it checks delta>=0. Some layers down in the code a call to record() is made https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/metric/internal/LatencyStatAndMetric.java#L124 which increments the count of samples in the current bucket. Eventually an average is calculated using the total latency over the interval / the count of samples in the bucket. I'm wondering does this lead to a miscalculation? Specifically I would expect that if we are sampling 5 tuples per 100 and we only get 100 tuples per minute (or calculation interval) that the average would be total latency/5 In this case it would seem to be total latency/100 Since the sample count is increased even when delta==0. I could be missing something obvious as I'm far from an expert in this code. Also the basic granularity of measurement in the code is millisecond. For systems that process tuples sub-millisecond is there a way to make this more granular, microseconds at least? Thanks.
