Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Stephen Mallette Wed, 06 Jan 2021 07:05:44 -0800

As there hasn't been any real objection to this direction I've started code
review on the PR: https://github.com/apache/tinkerpop/pull/1375


On Thu, Dec 24, 2020 at 1:25 AM js guo <[email protected]> wrote:

> OK. I have updated the PR to target master branch.
> If you have time to review my code, please note the `PercentileGlobalStep`
> implementation which holds intermediate results as instance variables to
> avoid extra object creation. If this works, I think we can apply similar
> behavior to `MaxGlobalStep`, `MinGlobalStep` and `MeanGlobalStep` by some
> refactoring.
>   List<S> buffer;
>   private final int[] percentiles;
>
> Another thing is about the `math()` step to support full listing of
> possible math operations. Below are my  assumptions of usage, in which
> `math` step is always a local map step and accepts a list of numbers as
> input. This of course needs further discussion.
>
> gremlin> g.inject([1,2,3]).math('max _')
> ==>3
> gremlin> g.inject([1,2,3]).math('min _')
> ==>1
> gremlin> g.inject([1,2,3]).math('mean _')
> ==>2
> gremlin> g.inject([1,2,3]).math('stdev _')
> ==>0.816
> gremlin> g.inject([1,2,3]).math('percentile 50')
> ==>2
> gremlin> g.inject([1,2,3]).math('product _')
> ==>6
>
> On 2020/12/21 12:36:22, Stephen Mallette <[email protected]> wrote:
> > After some more thought on this, I don't think we need to do a scoped
> > math() step, but I think an improvement that allows both individual
> numbers
> > as well as arrays of numbers as variables would suffice. Then math()
> could
> > contain a full listing of all the possible math operations that can be
> > executed and from that set we could choose, for performance or usability
> > reasons, to promote that operation to a first-class reducing step. That
> > feels like a reasonably consistent story to guide our choices by, while
> > expanding the quality of the math() step considerably.
> >
> >
> >
> > On Fri, Dec 18, 2020 at 2:29 AM js guo <[email protected]> wrote:
> >
> > > Thanks for the reply. We definitely can have a separate DISCUSS thread
> for
> > > details of refactoring reducing of number streams.
> > >
> > > As for my PR, agree that new steps are better targeted in 3.5.0
> versions
> > > instead of minor fix versions. I will later apply the changes to master
> > > branch and raise a new PR.
> > >
> > > Regards,
> > > Junshi
> > >
> > > On 2020/12/16 18:36:21, Stephen Mallette <[email protected]> wrote:
> > > > Some responses inline:
> > > >
> > > > On Fri, Dec 11, 2020 at 12:53 AM js guo <[email protected]> wrote:
> > > >
> > > > > Thanks for the reply. It is a good idea to provide reducing
> operations
> > > > > through math() step. But from my understanding, we still need
> different
> > > > > reducing steps or at least different seed suppliers and reducing
> > > operators
> > > > > in the back-end.
> > > > >
> > > > > gremlin> g.V().values('age').fold().math(local, "stdev(_)")
> > > > > ==>0.816
> > > > > gremlin> g.inject([1,2,3]).math(local, "product(_)")
> > > > > ==>6
> > > > >
> > > > > One of the advantage of a reducing step is that we do not need to
> hold
> > > the
> > > > > whole collection of numbers. Take standard deviation calculation
> for
> > > > > example, Kelvin’s solution requires manipulation of number arrays.
> > > With a
> > > > > reducing step, we can accumulate value sum, square sum and count
> > > during the
> > > > > traversal and get a final result with sqrt((E(X)^2 - E(X^2)). The
> > > latter
> > > > > has a better performance together with potentially lower memory
> > > requirement.
> > > > >
> > > >
> > > > true - that is a good argument.
> > > >
> > > >
> > > > > Maybe for math() step, when passing its scope as global, we can
> > > replace it
> > > > > with a reducing step internally. The main change is how users write
> > > queries
> > > > > with little change in underlying implementation. This way, we can
> align
> > > > > math functions into one single step, which I think is the right
> way to
> > > go
> > > > > considering that there might be more and more analytical functions
> to
> > > be
> > > > > supported. BTW, users still need to remember what steps are
> supported
> > > by
> > > > > math().
> > > > >
> > > >
> > > > That could be what's nagging at me as we have a math() step but then
> we
> > > > have steps that do math and i'm trying to understand why we sometimes
> > > sum()
> > > > and sometimes math("x+y")...unifying these concepts in some way might
> > > make
> > > > sense....i just want to be sure we consider the ramifications before
> we
> > > > charge ahead with existing patterns.
> > > >
> > > >
> > > > > gremlin> g.V().values('age').fold().math(local, “mean(_)”)   //
> default
> > > > > local scope, accepts array
> > > > > ==>30.75
> > > > > gremlin> g.V().values('age').math(global, “mean(_)”)  // internal
> > > > > execution with MeanGlobalStep
> > > > > ==>30.75
> > > > >
> > > > > A further thinking performance wise. In ReducingBarrierStep
> > > > > implementation, “projectTraverser” is used to project current
> traversal
> > > > > into single value and a “BinaryOperator” is used to reduce multiple
> > > > > single-values into one. For number manipulation, this process
> involves
> > > a
> > > > > lot of boxing and unboxing, and also object creation (e.g. creating
> > > > > MeanNumber for MeanGlobalStep
> > > > >
> > >
> https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59
> > > ).
> > > > >
> > > > >
> > > > > I have wondered if we can optimize the reducing framework for
> number
> > > > > related steps, like store intermediate values (for MeanGlobalStep,
> it
> > > is
> > > > > the “count” and value “sum”) as step instance variables, and the
> > > reducing
> > > > > operation happens directly in traverser projection?
> > > > >
> > > >
> > > > That was a convenient way to code that but I see what you're saying.
> Some
> > > > refactoring there could be worthwhile. Happy to discuss that further
> if
> > > you
> > > > feel you'd like to contribute something in that area. Perhaps start
> > > another
> > > > DISCUSS thread with some more details?
> > > >
> > > > As for your current PR, I still haven't had a chance to put the
> thought
> > > > I've wanted into this change. I will look to do that before the end
> of
> > > the
> > > > week. At this point, however, I don't see anyone with any specific
> > > > objections to the idea. As a logistical thought, I do think that
> should
> > > it
> > > > go forward as is, it should be a change targeted at 3.5.0 though
> rather
> > > > than 3.4.10.
> > > >
> > > >
> > > > >
> > > > > On 2020/12/09 12:24:04, Stephen Mallette <[email protected]>
> wrote:
> > > > > > Thanks for posting. In the math department, I think that these
> two
> > > steps
> > > > > > are asked for commonly and I think we have reached a point where
> the
> > > > > things
> > > > > > folks are doing with Gremlin are requiring steps of greater
> > > specificity
> > > > > so
> > > > > > this conversation is definitely expected. We currently have two
> > > sorts of
> > > > > > steps for operating on numbers: reducing steps like sum() and
> then
> > > math()
> > > > > > step for expressions. It's interesting what you can accomplish
> with
> > > those
> > > > > > two steps - note here how Kelvin manages standard deviation
> without
> > > > > lambdas:
> > > > > >
> > > > > > g.V().hasLabel('airport').
> > > > > >       values('runways').fold().as('runways').
> > > > > >       mean(local).as('mean').
> > > > > >       select('runways').unfold().
> > > > > >       math('(_-mean)^2').mean().math('sqrt(_)')
> > > > > >
> > > > > >
> https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone
> > > > > >
> > > > > > In any case, we can see that there is a fair bit of indirection
> > > there to
> > > > > do
> > > > > > the work of a simple stdev() step. I've often wondered if math()
> > > could
> > > > > > behave both in the way it does now and as a form of reducing
> step. In
> > > > > that
> > > > > > way we could quietly add new math functions without forming new
> > > steps,
> > > > > as I
> > > > > > can't help imaging that the addition of stdev() and percentile()
> will
> > > > > then
> > > > > > follow with: variance(), covariance(), confidence() and so on.
> > > Kelvin
> > > > > > recently asked me about mult() for use cases that he sees from
> time
> > > to
> > > > > time.
> > > > > >
> > > > > > As it stands our math expression library exp4j:
> > > > > >
> > > > > > https://www.objecthunter.net/exp4j/
> > > > > >
> > > > > > is good at extensibility but isn't' really formed well out of the
> > > box to
> > > > > > handle reducing operations because its architecture forces you to
> > > specify
> > > > > > the number of arguments it will take up front and those arguments
> > > must be
> > > > > > double:
> > > > > >
> > > > > > https://www.objecthunter.net/exp4j/#Custom_functions
> > > > > >
> > > > > > So, that would be an issue to contend with, but technical issues
> > > aside
> > > > > and
> > > > > > focusing instead on the user angle, would math() that worked as
> > > follows
> > > > > be
> > > > > > a good path?
> > > > > >
> > > > > > gremlin> g.V().values('ages').fold().math(local, "stdev(_)")
> > > > > > ==>0.816
> > > > > > gremlin> g.inject([1,2,3]).math(local, "product(_)")
> > > > > > ==>6
> > > > > >
> > > > > > And then, what distinction would there be between a math() step
> and
> > > first
> > > > > > class "math steps" like sum(), min(), max(), and mean()? in other
> > > words,
> > > > > > why would those exist if math() could already do it all? What
> makes a
> > > > > math
> > > > > > operation "common" enough to beget its own first class
> > > representation?
> > > > > >
> > > > > > Just to be clear, I'm not saying we shouldn't add
> > > stdev()/percentile() -
> > > > > I
> > > > > > just want to consider all the design possibilities and talk them
> > > through.
> > > > > > Thanks again for bringing up this conversation. I will link this
> > > thread
> > > > > to
> > > > > > your JIRA for reference.
> > > > > >
> > > > > >
> > > > > > On Wed, Dec 9, 2020 at 6:40 AM js guo <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > >
> > > > > > > We are using tinkerpop Gremlin in our risk detection cases.
> Some
> > > > > analytical
> > > > > > > calculations are used frequently, yet there is no corresponding
> > > steps
> > > > > in
> > > > > > > hand.
> > > > > > >
> > > > > > > I am thinking that some general analytical steps can be added
> in
> > > > > Gremlin.
> > > > > > > e.g. steps to calculate standard deviation and percentile. The
> > > example
> > > > > > > usage might be as follows.
> > > > > > > --------------------------------
> > > > > > > gremlin> g.V().values('ages')
> > > > > > > ==>1
> > > > > > > ==>2
> > > > > > > ==>3
> > > > > > > gremlin> g.V().values('ages').stdev()
> > > > > > > ==>0.816
> > > > > > > gremlin> g.V().values('ages').fold().stdev(Scope.local)
> > > > > > > ==>0.816
> > > > > > >
> > > > > > > gremlin> g.V().values('ages').percentile(50)
> > > > > > > ==>2
> > > > > > > // one percentile, return single value
> > > > > > > gremlin> g.V().values('ages').percentile(0, 100)
> > > > > > > ==>[0: 1, 100: 3]
> > > > > > > // multiple percentiles, return a map
> > > > > > > --------------------------------
> > > > > > >
> > > > > > > Sorry for not emailing earlier, I have created a JIRA ticket
> for
> > > this
> > > > > > > https://issues.apache.org/jira/browse/TINKERPOP-2487.
> > > > > > >
> > > > > > > As new steps are already used in our cases, we are glad to
> offer
> > > the
> > > > > > > implementation for review, if you think it good to add the two
> > > steps.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Junshi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Reply via email to