Thanks for the background John.

I am happy with `Windowed` and `Cumulative`, too.


-Matthias

On 7/17/19 11:22 AM, Ryanne Dolan wrote:
> John, makes sense to me! Thanks.
> 
> Ryanne
> 
> On Wed, Jul 17, 2019, 1:16 PM John Roesler <j...@confluent.io> wrote:
> 
>> Agreed. I think the names are actually not ambiguous once you recall
>> that the stats summarize measurements and each measurement is a
>> floating point number, but there's enough overlap that I also was
>> initially confused as well. I do plan to make this super clear in the
>> documentation.
>>
>> Thanks,
>> -John
>>
>> On Wed, Jul 17, 2019 at 1:08 PM Sophie Blee-Goldman <sop...@confluent.io>
>> wrote:
>>>
>>> Sounds good to me
>>>
>>> By the way, while I agree that we can't really do better than Sum and
>> Count
>>> I will say I also found the distinction a bit unclear at first glance. We
>>> should at least document clearly that "Sum" is a "sum of values" whereas
>>> "Count" is a "number of things" -- but that doesn't need to be part of
>> the
>>> KIP
>>>
>>> On Wed, Jul 17, 2019 at 11:00 AM John Roesler <j...@confluent.io> wrote:
>>>
>>>> Thanks for the replies.
>>>>
>>>> I guess that if we did add (e.g.) ExponentiallyWeightedWindowedX or
>>>> something, it should still be pretty obvious that WindowedX is the
>>>> unweighted version? In that case, I buy the argument that we don't
>>>> need "Simple" and we can just go with:
>>>>
>>>> WindowedSum, WindowedCount
>>>> CumulativeSum, CumulativeCount
>>>>
>>>> Sound good?
>>>> Thanks,
>>>> -John
>>>>
>>>> On Wed, Jul 17, 2019 at 11:53 AM Sophie Blee-Goldman
>>>> <sop...@confluent.io> wrote:
>>>>>
>>>>> Thanks for the crash course in statistical terms :)
>>>>>
>>>>> In light of this I definitely support Cumulative{Sum,Count}, but I'm
>>>> really
>>>>> not crazy about SimpleWindowed{Sum,Count} (vs just Windowed). Not so
>> much
>>>>> because of its unfortunate length (although that is unfortunate it
>>>>> shouldn't be a deciding factor) but because it seems to have the
>>>> potential
>>>>> to confuse further. I'm not sure what we gain by adding "Simple"
>> since to
>>>>> me at least, the unweighted-ness is obvious and the definition of
>> simple
>>>> is
>>>>> not. To those who haven't been exposed to the finer details of
>>>> statistical
>>>>> definitions, I think they are more likely to read "SimpleXX" and
>> wonder
>>>> "is
>>>>> there an 'advanced' or non-simple kind of Windowed?" than they are to
>>>>> wonder what is the weighting behind these metrics.
>>>>>
>>>>> Sophie
>>>>>
>>>>> On Wed, Jul 17, 2019 at 8:18 AM John Roesler <j...@confluent.io>
>> wrote:
>>>>>
>>>>>> Thanks for the discussion, all.
>>>>>>
>>>>>> I've done a little more research into the statistical terminology.
>>>>>> Matthias is correct, "running" and "moving" appear to be synonyms.
>>>>>> Unfortunately, both can be computed either over a window of the
>> last N
>>>>>> measurements or over all prior measurements. "Moving" just
>> signifies
>>>>>> that the statistic is computed over a "live" data set, i.e., a
>>>>>> continuous stream of measurements, and the expectation is that the
>>>>>> stat would be updated in response to new measurements.
>>>>>>
>>>>>> I found https://en.wikipedia.org/wiki/Moving_average to have a
>> pretty
>>>>>> good overview of the whole picture.
>>>>>>
>>>>>> After considering the discussion so far and some light reading, it
>>>>>> seems like "Cumulative" is truly the correct term for the all-time
>>>>>> metrics:
>>>>>>
>>>>>>> In a cumulative moving average, the data arrive in an
>>>>>>> ordered datum stream, and the user would like to get
>>>>>>> the average of all of the data up until the current datum
>>>>>>> point.
>>>>>>
>>>>>> I know that we previously felt that "cumulative" was too much of a
>>>>>> mouthful, but it seems like our quest for a terser term led us
>> into a
>>>>>> briar patch. Also, now there is an independent source (the wiki
>> page)
>>>>>> indicating that this is indeed the correct term, and it doesn't
>> offer
>>>>>> any synonyms to choose from. Maybe we can take comfort in the fact
>>>>>> that we'll rarely be saying the name of the classes out loud.
>>>>>>
>>>>>> As far as moving stats that operate over a window of the last N
>>>>>> measurements, there are multiple options, including Simple
>>>>>> (unweighted), Weighted, and Exponentially Weighted, and presumably
>>>>>> infinite variations with other weighting functions. In our domain,
>>>>>> there is only one weighting function available, but it's still more
>>>>>> self-documenting and future-proof to specify the type of windowed
>>>>>> statistic. Therefore, I'm proposing "Simple" as the term for the
>>>>>> windowed (aka sampled) stats, while keeping Windowed in the name to
>>>>>> distinguish it from the all-time metrics.
>>>>>>
>>>>>>> In financial applications a simple moving average (SMA)
>>>>>>> is the unweighted mean of the previous n data.
>>>>>>
>>>>>> Therefore, we would have the proposed matrix:
>>>>>>
>>>>>> SimpleWindowedSum, SimpleWindowedCount
>>>>>> CumulativeSum, CumulativeCount
>>>>>>
>>>>>> Again, all these proposed names are less pithy than we might wish,
>> but
>>>>>> the whole point of this exercise is to demystify and disambiguate
>>>>>> them. It seems like the discussion so far illustrates the futility
>> of
>>>>>> trying to find names that are both short and descriptive.
>>>>>>
>>>>>> How does that sound?
>>>>>> -John
>>>>>>
>>>>>> On Tue, Jul 16, 2019 at 4:43 PM Matthias J. Sax <
>> matth...@confluent.io
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> It's a fair point that Ryanne raises. However, "running sum" is
>> the
>>>> same
>>>>>>> as "moving sum" from my understanding.
>>>>>>>
>>>>>>> The issue is still, that `Sum` and `Count` which seem to be the
>>>> cleanest
>>>>>>> names cannot be used. While I agree that `TotalSum` and
>> `TotalCount`
>>>> is
>>>>>>> somewhat redundant, I still think it the best suggestion so far.
>>>>>>>
>>>>>>> For the "sampled" version, I am personally fine with either
>>>> `MovingXxx`,
>>>>>>> `WindowedXxx`, or `RunningXxx` -- to me, that are all equally
>> good to
>>>>>>> describe the semantics.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7/16/19 2:25 PM, Sophie Blee-Goldman wrote:
>>>>>>>> I'm +1 on Windowed, was about to suggest that as I was
>> catching up
>>>> on
>>>>>> the
>>>>>>>> discussion but Bill beat me to it :)
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2019 at 2:23 PM Bill Bejeck <bbej...@gmail.com
>>>
>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> Thanks for the updates.
>>>>>>>>>
>>>>>>>>> I like RunningCount and RunningSum.
>>>>>>>>>
>>>>>>>>> What about WindowedCount, WindowedSum instead of Moving?
>>>>>>>>> I'm just throwing this out there as Windowed seems more
>> intuitive
>>>> to
>>>>>> me,
>>>>>>>>> but I'm not married to the idea.
>>>>>>>>>
>>>>>>>>> -Bill
>>>>>>>>>
>>>>>>>>> On Tue, Jul 16, 2019 at 5:09 PM John Roesler <
>> j...@confluent.io>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> No worries! Choosing good public API names is a high-impact
>>>> design
>>>>>>>>>> activity.
>>>>>>>>>>
>>>>>>>>>> Matthias, Bruno, Bill, and Stanislav,
>>>>>>>>>>
>>>>>>>>>> You've all contributed to this discussion or the vote so
>> far...
>>>> How
>>>>>> do
>>>>>>>>>> you feel about the proposed name change:
>>>>>>>>>>
>>>>>>>>>> MovingCount, MovingSum   (instead of Sampled)
>>>>>>>>>> RunningCount, RunningSum   (Instead of Total)
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 16, 2019 at 3:04 PM Ryanne Dolan <
>>>> ryannedo...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> John, that makes sense to me. Sorry for the bikeshedding.
>>>>>>>>>>>
>>>>>>>>>>> Ryanne
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:49 PM John Roesler <
>>>> j...@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the explanation and the suggestion, Ryanne,
>>>>>>>>>>>>
>>>>>>>>>>>> I went with "sampled" just because these are instances of
>>>>>>>>> SampledStat,
>>>>>>>>>>>> which in the Kafka Metrics ecosystem are computed from a
>>>> window of
>>>>>>>>>>>> recent samples. Thinking more about it, the fact that they
>> are
>>>>>>>>> sampled
>>>>>>>>>>>> and the fact that they are windowed are orthogonal, which
>> is
>>>> what
>>>>>>>>>>>> you're pointing out... sampling by itself doesn't indicate
>> that
>>>>>> it's
>>>>>>>>> a
>>>>>>>>>>>> moving average.
>>>>>>>>>>>>
>>>>>>>>>>>> Since there is no way in Kafka Metrics for a metric to be
>>>> sampled
>>>>>> and
>>>>>>>>>>>> not windowed/moving/decaying, calling them Sampled would
>> never
>>>> be
>>>>>>>>>>>> incorrect. But to someone unfamiliar with the code, it
>> wouldn't
>>>>>>>>>>>> immediately suggest the behavior of the metric that
>> actually
>>>>>> matters.
>>>>>>>>>>>> That is, the behavior that distinguishes the two classes of
>>>> metrics
>>>>>>>>> we
>>>>>>>>>>>> want to disambiguate here.
>>>>>>>>>>>>
>>>>>>>>>>>> It sounds like you'd suggest a new matrix of names:
>>>>>>>>>>>> MovingCount, MovingSum
>>>>>>>>>>>> RunningCount, RunningSum
>>>>>>>>>>>>
>>>>>>>>>>>> Are these names unambiguous and self explanatory?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:32 PM Ryanne Dolan <
>>>>>> ryannedo...@gmail.com
>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> measurements, which decay/expire over time
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks John for the clarification. This was my
>> (re-)reading
>>>> of the
>>>>>>>>>> code,
>>>>>>>>>>>>> but this is not what I think of when I hear "sampled". In
>>>> fact,
>>>>>>>>>> you'll
>>>>>>>>>>>>> notice that the Wikipedia pages for "Sample (statistics)"
>> and
>>>>>>>>> "Sample
>>>>>>>>>>>>> (signal processing)" do not contain the words decay,
>> expire,
>>>>>>>>> recent,
>>>>>>>>>>>>> history, or anything similar.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Similar to "running", I'd suggest the more correct
>> "moving",
>>>> as in
>>>>>>>>>>>> "moving
>>>>>>>>>>>>> average" and "moving sum", which involve looking back N
>>>> samples,
>>>>>>>>>>>> applying a
>>>>>>>>>>>>> sliding window, decaying over time etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryanne
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 16, 2019, 11:58 AM John Roesler <
>>>> j...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for raising this concern, Ryanne,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "Sampled" indicates that the metrics is sampled, namely
>> that
>>>> we
>>>>>>>>>>>>>> maintain a set of samples from recent value measurements,
>>>> which
>>>>>>>>>>>>>> decay/expire over time. So, the metric value is only
>>>>>>>>>> representative of
>>>>>>>>>>>>>> the recent past.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> "Total" indicates that the metric value contains all the
>>>>>>>>>> information
>>>>>>>>>>>>>> from the creation of the metric. For example., the total
>> sum
>>>>>>>>> would
>>>>>>>>>>>>>> include all measurements since the app started up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems like your concern is that the word "total"
>> doesn't
>>>>>>>>> really
>>>>>>>>>>>>>> pinpoint this meaning, which is true. It's especially
>>>> confusing
>>>>>>>>>> that
>>>>>>>>>>>>>> another meaning of "total" is synonymous with "sum",
>>>> rendering
>>>>>>>>> the
>>>>>>>>>>>>>> name "TotalSum" sort of absurd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We previously considered "cumulative", which was
>> rejected as
>>>> a
>>>>>>>>>>>>>> mouthful (it's four syllables) .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You mentioned "running", which might be a more
>> appropriate
>>>>>>>>> modifier
>>>>>>>>>>>>>> (RunningSum and RunningCount).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What would everyone think about that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 11:27 AM Ryanne Dolan <
>>>>>>>>>> ryannedo...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> John, I mentioned on the VOTE thread that the proposed
>> names
>>>>>>>>> are
>>>>>>>>>> a
>>>>>>>>>>>> bit
>>>>>>>>>>>>>>> confusing,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> given that "sum", "total", and "count" are roughly
>>>>>>>>>> synonymous...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In particular, TotalSum is, I think, a "running total",
>>>> though
>>>>>>>>>> the
>>>>>>>>>>>> naming
>>>>>>>>>>>>>>> doesn't really capture that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think to avoid confusion, we should define exactly
>> what
>>>>>>>>>> "total" and
>>>>>>>>>>>>>>> "sampled" are supposed to indicate, and perhaps pick
>>>>>>>>> appropriate
>>>>>>>>>>>> naming
>>>>>>>>>>>>>>> from there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ryanne
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 12, 2019 at 1:42 PM John Roesler <
>>>>>>>>> j...@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey, thanks Matthias and Bruno,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree, "Cumulative" is a mouthful. "TotalX" sounds
>> fine
>>>> to
>>>>>>>>>> me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, yes, I would have liked to not have any modifier
>> for
>>>>>>>>>>>>>>>> "non-sampled", but there is a name conflict with Sum.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll update the KIP to reflect "TotalX" and then start
>> the
>>>>>>>>> vote
>>>>>>>>>>>> thread.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jul 12, 2019 at 11:27 AM Bruno Cadonna <
>>>>>>>>>> br...@confluent.io
>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> OK, makes sense. Then, I am in favour of TotalCount
>> and
>>>>>>>>>> TotalSum.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Bruno
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 12, 2019 at 12:57 AM Matthias J. Sax <
>>>>>>>>>>>>>> matth...@confluent.io>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> `Sum` is an existing name, for the "sampled sum"
>> metric,
>>>>>>>>>> that
>>>>>>>>>>>> gets
>>>>>>>>>>>>>>>>>> deprecated. Hence, we cannot use it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If we cannot use `Sum` and use `TotalSum`, we should
>> also
>>>>>>>>>> not
>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>> `Count` but `TotalCount` for consistency.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/11/19 12:58 PM, Bruno Cadonna wrote:
>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for the KIP.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> LGTM
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I also do not like CumulativeSum/Count so much. I
>>>>>>>>>> propose to
>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> call
>>>>>>>>>>>>>>>>>>> it Sum and Count.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I understand that you want to unequivocally
>> distinguish
>>>>>>>>>> the
>>>>>>>>>>>> two
>>>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>>> functions by their names, but I have the feeling the
>>>>>>>>>> names
>>>>>>>>>>>> become
>>>>>>>>>>>>>>>>>>> artificially complex. The exact semantics can also
>> be
>>>>>>>>>>>> documented
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> the javadocs, which btw could also be improved in
>> those
>>>>>>>>>>>> classes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Bruno
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Jul 11, 2019 at 8:25 PM Matthias J. Sax <
>>>>>>>>>>>>>>>> matth...@confluent.io> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the KIP. Overall LGTM.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The only though I have is, if we may want to use
>>>>>>>>>> `TotalSum`
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> `TotalCount` instead of `CumulativeSum/Count` as
>>>>>>>>> names?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 7/11/19 9:31 AM, John Roesler wrote:
>>>>>>>>>>>>>>>>>>>>> Hi Kafka devs,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'd like to propose KIP-488 as a minor cleanup of
>>>>>>>>> some
>>>>>>>>>> of
>>>>>>>>>>>> our
>>>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>>>>> implementations.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> KIP-488:
>>>>>>>>> https://cwiki.apache.org/confluence/x/kkAyBw
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Over time, iterative updates to these metrics has
>>>>>>>>>> resulted
>>>>>>>>>>>> in a
>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>>>>>> confusing little collection of classes, and I've
>>>>>>>>>> personally
>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>>>>> involved in three separate moderately
>> time-consuming
>>>>>>>>>>>>>> iterations of
>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>> or someone else trying to work out which metrics
>> are
>>>>>>>>>>>>>> available, and
>>>>>>>>>>>>>>>>>>>>> which ones are desired for a given use case. One
>> of
>>>>>>>>>> these
>>>>>>>>>>>> was
>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>>>>> a long-running bug in Kafka Streams' metrics, so
>> not
>>>>>>>>>> only
>>>>>>>>>>>> has
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>> confusion been a time sink, but it has also led to
>>>>>>>>>> bugs.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm hoping this change won't be too controversial.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to