Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Stephen Mallette Wed, 16 Dec 2020 10:36:42 -0800

Some responses inline:

On Fri, Dec 11, 2020 at 12:53 AM js guo <jsguo8...@gmail.com> wrote:


> Thanks for the reply. It is a good idea to provide reducing operations
> through math() step. But from my understanding, we still need different
> reducing steps or at least different seed suppliers and reducing operators
> in the back-end.
>
> gremlin> g.V().values('age').fold().math(local, "stdev(_)")
> ==>0.816
> gremlin> g.inject([1,2,3]).math(local, "product(_)")
> ==>6
>
> One of the advantage of a reducing step is that we do not need to hold the
> whole collection of numbers. Take standard deviation calculation for
> example, Kelvin’s solution requires manipulation of number arrays. With a
> reducing step, we can accumulate value sum, square sum and count during the
> traversal and get a final result with sqrt((E(X)^2 - E(X^2)). The latter
> has a better performance together with potentially lower memory requirement.
>

true - that is a good argument.


> Maybe for math() step, when passing its scope as global, we can replace it
> with a reducing step internally. The main change is how users write queries
> with little change in underlying implementation. This way, we can align
> math functions into one single step, which I think is the right way to go
> considering that there might be more and more analytical functions to be
> supported. BTW, users still need to remember what steps are supported by
> math().
>

That could be what's nagging at me as we have a math() step but then we
have steps that do math and i'm trying to understand why we sometimes sum()
and sometimes math("x+y")...unifying these concepts in some way might make
sense....i just want to be sure we consider the ramifications before we
charge ahead with existing patterns.


> gremlin> g.V().values('age').fold().math(local, “mean(_)”)   // default
> local scope, accepts array
> ==>30.75
> gremlin> g.V().values('age').math(global, “mean(_)”)  // internal
> execution with MeanGlobalStep
> ==>30.75
>
> A further thinking performance wise. In ReducingBarrierStep
> implementation, “projectTraverser” is used to project current traversal
> into single value and a “BinaryOperator” is used to reduce multiple
> single-values into one. For number manipulation, this process involves a
> lot of boxing and unboxing, and also object creation (e.g. creating
> MeanNumber for MeanGlobalStep
> https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59).
>
>
> I have wondered if we can optimize the reducing framework for number
> related steps, like store intermediate values (for MeanGlobalStep, it is
> the “count” and value “sum”) as step instance variables, and the reducing
> operation happens directly in traverser projection?
>

That was a convenient way to code that but I see what you're saying. Some
refactoring there could be worthwhile. Happy to discuss that further if you
feel you'd like to contribute something in that area. Perhaps start another
DISCUSS thread with some more details?

As for your current PR, I still haven't had a chance to put the thought
I've wanted into this change. I will look to do that before the end of the
week. At this point, however, I don't see anyone with any specific
objections to the idea. As a logistical thought, I do think that should it
go forward as is, it should be a change targeted at 3.5.0 though rather
than 3.4.10.


>
> On 2020/12/09 12:24:04, Stephen Mallette <spmalle...@gmail.com> wrote:
> > Thanks for posting. In the math department, I think that these two steps
> > are asked for commonly and I think we have reached a point where the
> things
> > folks are doing with Gremlin are requiring steps of greater specificity
> so
> > this conversation is definitely expected. We currently have two sorts of
> > steps for operating on numbers: reducing steps like sum() and then math()
> > step for expressions. It's interesting what you can accomplish with those
> > two steps - note here how Kelvin manages standard deviation without
> lambdas:
> >
> > g.V().hasLabel('airport').
> >       values('runways').fold().as('runways').
> >       mean(local).as('mean').
> >       select('runways').unfold().
> >       math('(_-mean)^2').mean().math('sqrt(_)')
> >
> > https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone
> >
> > In any case, we can see that there is a fair bit of indirection there to
> do
> > the work of a simple stdev() step. I've often wondered if math() could
> > behave both in the way it does now and as a form of reducing step. In
> that
> > way we could quietly add new math functions without forming new steps,
> as I
> > can't help imaging that the addition of stdev() and percentile() will
> then
> > follow with: variance(), covariance(), confidence() and so on.  Kelvin
> > recently asked me about mult() for use cases that he sees from time to
> time.
> >
> > As it stands our math expression library exp4j:
> >
> > https://www.objecthunter.net/exp4j/
> >
> > is good at extensibility but isn't' really formed well out of the box to
> > handle reducing operations because its architecture forces you to specify
> > the number of arguments it will take up front and those arguments must be
> > double:
> >
> > https://www.objecthunter.net/exp4j/#Custom_functions
> >
> > So, that would be an issue to contend with, but technical issues aside
> and
> > focusing instead on the user angle, would math() that worked as follows
> be
> > a good path?
> >
> > gremlin> g.V().values('ages').fold().math(local, "stdev(_)")
> > ==>0.816
> > gremlin> g.inject([1,2,3]).math(local, "product(_)")
> > ==>6
> >
> > And then, what distinction would there be between a math() step and first
> > class "math steps" like sum(), min(), max(), and mean()? in other words,
> > why would those exist if math() could already do it all? What makes a
> math
> > operation "common" enough to beget its own first class representation?
> >
> > Just to be clear, I'm not saying we shouldn't add stdev()/percentile() -
> I
> > just want to consider all the design possibilities and talk them through.
> > Thanks again for bringing up this conversation. I will link this thread
> to
> > your JIRA for reference.
> >
> >
> > On Wed, Dec 9, 2020 at 6:40 AM js guo <jsguo8...@gmail.com> wrote:
> >
> > > Hi team,
> > >
> > > We are using tinkerpop Gremlin in our risk detection cases. Some
> analytical
> > > calculations are used frequently, yet there is no corresponding steps
> in
> > > hand.
> > >
> > > I am thinking that some general analytical steps can be added in
> Gremlin.
> > > e.g. steps to calculate standard deviation and percentile. The example
> > > usage might be as follows.
> > > --------------------------------
> > > gremlin> g.V().values('ages')
> > > ==>1
> > > ==>2
> > > ==>3
> > > gremlin> g.V().values('ages').stdev()
> > > ==>0.816
> > > gremlin> g.V().values('ages').fold().stdev(Scope.local)
> > > ==>0.816
> > >
> > > gremlin> g.V().values('ages').percentile(50)
> > > ==>2
> > > // one percentile, return single value
> > > gremlin> g.V().values('ages').percentile(0, 100)
> > > ==>[0: 1, 100: 3]
> > > // multiple percentiles, return a map
> > > --------------------------------
> > >
> > > Sorry for not emailing earlier, I have created a JIRA ticket for this
> > > https://issues.apache.org/jira/browse/TINKERPOP-2487.
> > >
> > > As new steps are already used in our cases, we are glad to offer the
> > > implementation for review, if you think it good to add the two steps.
> > >
> > > Regards,
> > > Junshi
> > >
> >
>

Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Reply via email to