Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

js guo Tue, 15 Dec 2020 04:13:26 -0800

Thanks. I think nPk and nCk calculations are different from the reducing 
barrier step candidates.


For nPk and nCk type calculations, as long as `factorial` is supported by 
`math` step, I think we can manage it by using `factorial`, *` and `/`. But if 
`factorial` is supported in `math` step, I think it would not take much effort 
to support nPk and nCk as well.

Regards,
Junshi

On 2020/12/14 20:03:01, Kelvin Lawrence <[email protected]> wrote: 
> I quite like the idea of adding some additional reducing barrier steps that 
> make it easy to do simple statistical analysis on streams of numbers. Likely 
> candidates are `stddev`, `variance` and `product`. It's not easy today to 
> calculate the product of a stream in Gremlin using the `math` step. I 
> considered suggesting `factorial` also to aid with nPk and nCk type 
> calculations but I am not sure about that one.
> Cheers,Kelvin
> Kelvin R. Lawrence  
> 
>     On Thursday, December 10, 2020, 11:53:53 PM CST, js guo 
> <[email protected]> wrote:  
>  
>  Thanks for the reply. It is a good idea to provide reducing operations 
> through math() step. But from my understanding, we still need different 
> reducing steps or at least different seed suppliers and reducing operators in 
> the back-end.
> 
> gremlin> g.V().values('age').fold().math(local, "stdev(_)")
> ==>0.816
> gremlin> g.inject([1,2,3]).math(local, "product(_)")
> ==>6
> 
> One of the advantage of a reducing step is that we do not need to hold the 
> whole collection of numbers. Take standard deviation calculation for example, 
> Kelvin’s solution requires manipulation of number arrays. With a reducing 
> step, we can accumulate value sum, square sum and count during the traversal 
> and get a final result with sqrt((E(X)^2 - E(X^2)). The latter has a better 
> performance together with potentially lower memory requirement.
> 
> Maybe for math() step, when passing its scope as global, we can replace it 
> with a reducing step internally. The main change is how users write queries 
> with little change in underlying implementation. This way, we can align math 
> functions into one single step, which I think is the right way to go 
> considering that there might be more and more analytical functions to be 
> supported. BTW, users still need to remember what steps are supported by 
> math(). 
> 
> gremlin> g.V().values('age').fold().math(local, “mean(_)”)  // default local 
> scope, accepts array
> ==>30.75
> gremlin> g.V().values('age').math(global, “mean(_)”)  // internal execution 
> with MeanGlobalStep
> ==>30.75
> 
> A further thinking performance wise. In ReducingBarrierStep implementation, 
> “projectTraverser” is used to project current traversal into single value and 
> a “BinaryOperator” is used to reduce multiple single-values into one. For 
> number manipulation, this process involves a lot of boxing and unboxing, and 
> also object creation (e.g. creating MeanNumber for MeanGlobalStep 
> https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59).
>  
> 
> I have wondered if we can optimize the reducing framework for number related 
> steps, like store intermediate values (for MeanGlobalStep, it is the “count” 
> and value “sum”) as step instance variables, and the reducing operation 
> happens directly in traverser projection?
> 
> On 2020/12/09 12:24:04, Stephen Mallette <[email protected]> wrote: 
> > Thanks for posting. In the math department, I think that these two steps
> > are asked for commonly and I think we have reached a point where the things
> > folks are doing with Gremlin are requiring steps of greater specificity so
> > this conversation is definitely expected. We currently have two sorts of
> > steps for operating on numbers: reducing steps like sum() and then math()
> > step for expressions. It's interesting what you can accomplish with those
> > two steps - note here how Kelvin manages standard deviation without lambdas:
> > 
> > g.V().hasLabel('airport').
> >      values('runways').fold().as('runways').
> >      mean(local).as('mean').
> >      select('runways').unfold().
> >      math('(_-mean)^2').mean().math('sqrt(_)')
> > 
> > https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone
> > 
> > In any case, we can see that there is a fair bit of indirection there to do
> > the work of a simple stdev() step. I've often wondered if math() could
> > behave both in the way it does now and as a form of reducing step. In that
> > way we could quietly add new math functions without forming new steps, as I
> > can't help imaging that the addition of stdev() and percentile() will then
> > follow with: variance(), covariance(), confidence() and so on.  Kelvin
> > recently asked me about mult() for use cases that he sees from time to time.
> > 
> > As it stands our math expression library exp4j:
> > 
> > https://www.objecthunter.net/exp4j/
> > 
> > is good at extensibility but isn't' really formed well out of the box to
> > handle reducing operations because its architecture forces you to specify
> > the number of arguments it will take up front and those arguments must be
> > double:
> > 
> > https://www.objecthunter.net/exp4j/#Custom_functions
> > 
> > So, that would be an issue to contend with, but technical issues aside and
> > focusing instead on the user angle, would math() that worked as follows be
> > a good path?
> > 
> > gremlin> g.V().values('ages').fold().math(local, "stdev(_)")
> > ==>0.816
> > gremlin> g.inject([1,2,3]).math(local, "product(_)")
> > ==>6
> > 
> > And then, what distinction would there be between a math() step and first
> > class "math steps" like sum(), min(), max(), and mean()? in other words,
> > why would those exist if math() could already do it all? What makes a math
> > operation "common" enough to beget its own first class representation?
> > 
> > Just to be clear, I'm not saying we shouldn't add stdev()/percentile() - I
> > just want to consider all the design possibilities and talk them through.
> > Thanks again for bringing up this conversation. I will link this thread to
> > your JIRA for reference.
> > 
> > 
> > On Wed, Dec 9, 2020 at 6:40 AM js guo <[email protected]> wrote:
> > 
> > > Hi team,
> > >
> > > We are using tinkerpop Gremlin in our risk detection cases. Some 
> > > analytical
> > > calculations are used frequently, yet there is no corresponding steps in
> > > hand.
> > >
> > > I am thinking that some general analytical steps can be added in Gremlin.
> > > e.g. steps to calculate standard deviation and percentile. The example
> > > usage might be as follows.
> > > --------------------------------
> > > gremlin> g.V().values('ages')
> > > ==>1
> > > ==>2
> > > ==>3
> > > gremlin> g.V().values('ages').stdev()
> > > ==>0.816
> > > gremlin> g.V().values('ages').fold().stdev(Scope.local)
> > > ==>0.816
> > >
> > > gremlin> g.V().values('ages').percentile(50)
> > > ==>2
> > > // one percentile, return single value
> > > gremlin> g.V().values('ages').percentile(0, 100)
> > > ==>[0: 1, 100: 3]
> > > // multiple percentiles, return a map
> > > --------------------------------
> > >
> > > Sorry for not emailing earlier, I have created a JIRA ticket for this
> > > https://issues.apache.org/jira/browse/TINKERPOP-2487.
> > >
> > > As new steps are already used in our cases, we are glad to offer the
> > > implementation for review, if you think it good to add the two steps.
> > >
> > > Regards,
> > > Junshi
> > >
> > 
>

Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Reply via email to