Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Kelvin Lawrence Mon, 14 Dec 2020 12:03:17 -0800

I quite like the idea of adding some additional reducing barrier steps that 
make it easy to do simple statistical analysis on streams of numbers. Likely 
candidates are `stddev`, `variance` and `product`. It's not easy today to 
calculate the product of a stream in Gremlin using the `math` step. I 
considered suggesting `factorial` also to aid with nPk and nCk type 
calculations but I am not sure about that one.
Cheers,Kelvin
Kelvin R. Lawrence

    On Thursday, December 10, 2020, 11:53:53 PM CST, js guo 
<[email protected]> wrote:  

 Thanks for the reply. It is a good idea to provide reducing operations through 
math() step. But from my understanding, we still need different reducing steps 
or at least different seed suppliers and reducing operators in the back-end.

gremlin> g.V().values('age').fold().math(local, "stdev(_)")
==>0.816
gremlin> g.inject([1,2,3]).math(local, "product(_)")
==>6

One of the advantage of a reducing step is that we do not need to hold the 
whole collection of numbers. Take standard deviation calculation for example, 
Kelvin’s solution requires manipulation of number arrays. With a reducing step, 
we can accumulate value sum, square sum and count during the traversal and get 
a final result with sqrt((E(X)^2 - E(X^2)). The latter has a better performance 
together with potentially lower memory requirement.

Maybe for math() step, when passing its scope as global, we can replace it with 
a reducing step internally. The main change is how users write queries with 
little change in underlying implementation. This way, we can align math 
functions into one single step, which I think is the right way to go 
considering that there might be more and more analytical functions to be 
supported. BTW, users still need to remember what steps are supported by 
math(). 

gremlin> g.V().values('age').fold().math(local, “mean(_)”)  // default local 
scope, accepts array
==>30.75
gremlin> g.V().values('age').math(global, “mean(_)”)  // internal execution 
with MeanGlobalStep
==>30.75

A further thinking performance wise. In ReducingBarrierStep implementation, 
“projectTraverser” is used to project current traversal into single value and a 
“BinaryOperator” is used to reduce multiple single-values into one. For number 
manipulation, this process involves a lot of boxing and unboxing, and also 
object creation (e.g. creating MeanNumber for MeanGlobalStep 
https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59).

I have wondered if we can optimize the reducing framework for number related 
steps, like store intermediate values (for MeanGlobalStep, it is the “count” 
and value “sum”) as step instance variables, and the reducing operation happens 
directly in traverser projection?

On 2020/12/09 12:24:04, Stephen Mallette <[email protected]> wrote: 
> Thanks for posting. In the math department, I think that these two steps
> are asked for commonly and I think we have reached a point where the things
> folks are doing with Gremlin are requiring steps of greater specificity so
> this conversation is definitely expected. We currently have two sorts of
> steps for operating on numbers: reducing steps like sum() and then math()
> step for expressions. It's interesting what you can accomplish with those
> two steps - note here how Kelvin manages standard deviation without lambdas:
> 
> g.V().hasLabel('airport').
>      values('runways').fold().as('runways').
>      mean(local).as('mean').
>      select('runways').unfold().
>      math('(_-mean)^2').mean().math('sqrt(_)')
> 
> https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone
> 
> In any case, we can see that there is a fair bit of indirection there to do
> the work of a simple stdev() step. I've often wondered if math() could
> behave both in the way it does now and as a form of reducing step. In that
> way we could quietly add new math functions without forming new steps, as I
> can't help imaging that the addition of stdev() and percentile() will then
> follow with: variance(), covariance(), confidence() and so on.  Kelvin
> recently asked me about mult() for use cases that he sees from time to time.
> 
> As it stands our math expression library exp4j:
> 
> https://www.objecthunter.net/exp4j/
> 
> is good at extensibility but isn't' really formed well out of the box to
> handle reducing operations because its architecture forces you to specify
> the number of arguments it will take up front and those arguments must be
> double:
> 
> https://www.objecthunter.net/exp4j/#Custom_functions
> 
> So, that would be an issue to contend with, but technical issues aside and
> focusing instead on the user angle, would math() that worked as follows be
> a good path?
> 
> gremlin> g.V().values('ages').fold().math(local, "stdev(_)")
> ==>0.816
> gremlin> g.inject([1,2,3]).math(local, "product(_)")
> ==>6
> 
> And then, what distinction would there be between a math() step and first
> class "math steps" like sum(), min(), max(), and mean()? in other words,
> why would those exist if math() could already do it all? What makes a math
> operation "common" enough to beget its own first class representation?
> 
> Just to be clear, I'm not saying we shouldn't add stdev()/percentile() - I
> just want to consider all the design possibilities and talk them through.
> Thanks again for bringing up this conversation. I will link this thread to
> your JIRA for reference.
> 
> 
> On Wed, Dec 9, 2020 at 6:40 AM js guo <[email protected]> wrote:
> 
> > Hi team,
> >
> > We are using tinkerpop Gremlin in our risk detection cases. Some analytical
> > calculations are used frequently, yet there is no corresponding steps in
> > hand.
> >
> > I am thinking that some general analytical steps can be added in Gremlin.
> > e.g. steps to calculate standard deviation and percentile. The example
> > usage might be as follows.
> > --------------------------------
> > gremlin> g.V().values('ages')
> > ==>1
> > ==>2
> > ==>3
> > gremlin> g.V().values('ages').stdev()
> > ==>0.816
> > gremlin> g.V().values('ages').fold().stdev(Scope.local)
> > ==>0.816
> >
> > gremlin> g.V().values('ages').percentile(50)
> > ==>2
> > // one percentile, return single value
> > gremlin> g.V().values('ages').percentile(0, 100)
> > ==>[0: 1, 100: 3]
> > // multiple percentiles, return a map
> > --------------------------------
> >
> > Sorry for not emailing earlier, I have created a JIRA ticket for this
> > https://issues.apache.org/jira/browse/TINKERPOP-2487.
> >
> > As new steps are already used in our cases, we are glad to offer the
> > implementation for review, if you think it good to add the two steps.
> >
> > Regards,
> > Junshi
> >
>

Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Reply via email to