As there hasn't been any real objection to this direction I've started code review on the PR: https://github.com/apache/tinkerpop/pull/1375
On Thu, Dec 24, 2020 at 1:25 AM js guo <jsguo8...@gmail.com> wrote: > OK. I have updated the PR to target master branch. > If you have time to review my code, please note the `PercentileGlobalStep` > implementation which holds intermediate results as instance variables to > avoid extra object creation. If this works, I think we can apply similar > behavior to `MaxGlobalStep`, `MinGlobalStep` and `MeanGlobalStep` by some > refactoring. > List<S> buffer; > private final int[] percentiles; > > Another thing is about the `math()` step to support full listing of > possible math operations. Below are my assumptions of usage, in which > `math` step is always a local map step and accepts a list of numbers as > input. This of course needs further discussion. > > gremlin> g.inject([1,2,3]).math('max _') > ==>3 > gremlin> g.inject([1,2,3]).math('min _') > ==>1 > gremlin> g.inject([1,2,3]).math('mean _') > ==>2 > gremlin> g.inject([1,2,3]).math('stdev _') > ==>0.816 > gremlin> g.inject([1,2,3]).math('percentile 50') > ==>2 > gremlin> g.inject([1,2,3]).math('product _') > ==>6 > > On 2020/12/21 12:36:22, Stephen Mallette <spmalle...@gmail.com> wrote: > > After some more thought on this, I don't think we need to do a scoped > > math() step, but I think an improvement that allows both individual > numbers > > as well as arrays of numbers as variables would suffice. Then math() > could > > contain a full listing of all the possible math operations that can be > > executed and from that set we could choose, for performance or usability > > reasons, to promote that operation to a first-class reducing step. That > > feels like a reasonably consistent story to guide our choices by, while > > expanding the quality of the math() step considerably. > > > > > > > > On Fri, Dec 18, 2020 at 2:29 AM js guo <jsguo8...@gmail.com> wrote: > > > > > Thanks for the reply. We definitely can have a separate DISCUSS thread > for > > > details of refactoring reducing of number streams. > > > > > > As for my PR, agree that new steps are better targeted in 3.5.0 > versions > > > instead of minor fix versions. I will later apply the changes to master > > > branch and raise a new PR. > > > > > > Regards, > > > Junshi > > > > > > On 2020/12/16 18:36:21, Stephen Mallette <spmalle...@gmail.com> wrote: > > > > Some responses inline: > > > > > > > > On Fri, Dec 11, 2020 at 12:53 AM js guo <jsguo8...@gmail.com> wrote: > > > > > > > > > Thanks for the reply. It is a good idea to provide reducing > operations > > > > > through math() step. But from my understanding, we still need > different > > > > > reducing steps or at least different seed suppliers and reducing > > > operators > > > > > in the back-end. > > > > > > > > > > gremlin> g.V().values('age').fold().math(local, "stdev(_)") > > > > > ==>0.816 > > > > > gremlin> g.inject([1,2,3]).math(local, "product(_)") > > > > > ==>6 > > > > > > > > > > One of the advantage of a reducing step is that we do not need to > hold > > > the > > > > > whole collection of numbers. Take standard deviation calculation > for > > > > > example, Kelvin’s solution requires manipulation of number arrays. > > > With a > > > > > reducing step, we can accumulate value sum, square sum and count > > > during the > > > > > traversal and get a final result with sqrt((E(X)^2 - E(X^2)). The > > > latter > > > > > has a better performance together with potentially lower memory > > > requirement. > > > > > > > > > > > > > true - that is a good argument. > > > > > > > > > > > > > Maybe for math() step, when passing its scope as global, we can > > > replace it > > > > > with a reducing step internally. The main change is how users write > > > queries > > > > > with little change in underlying implementation. This way, we can > align > > > > > math functions into one single step, which I think is the right > way to > > > go > > > > > considering that there might be more and more analytical functions > to > > > be > > > > > supported. BTW, users still need to remember what steps are > supported > > > by > > > > > math(). > > > > > > > > > > > > > That could be what's nagging at me as we have a math() step but then > we > > > > have steps that do math and i'm trying to understand why we sometimes > > > sum() > > > > and sometimes math("x+y")...unifying these concepts in some way might > > > make > > > > sense....i just want to be sure we consider the ramifications before > we > > > > charge ahead with existing patterns. > > > > > > > > > > > > > gremlin> g.V().values('age').fold().math(local, “mean(_)”) // > default > > > > > local scope, accepts array > > > > > ==>30.75 > > > > > gremlin> g.V().values('age').math(global, “mean(_)”) // internal > > > > > execution with MeanGlobalStep > > > > > ==>30.75 > > > > > > > > > > A further thinking performance wise. In ReducingBarrierStep > > > > > implementation, “projectTraverser” is used to project current > traversal > > > > > into single value and a “BinaryOperator” is used to reduce multiple > > > > > single-values into one. For number manipulation, this process > involves > > > a > > > > > lot of boxing and unboxing, and also object creation (e.g. creating > > > > > MeanNumber for MeanGlobalStep > > > > > > > > > https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59 > > > ). > > > > > > > > > > > > > > > I have wondered if we can optimize the reducing framework for > number > > > > > related steps, like store intermediate values (for MeanGlobalStep, > it > > > is > > > > > the “count” and value “sum”) as step instance variables, and the > > > reducing > > > > > operation happens directly in traverser projection? > > > > > > > > > > > > > That was a convenient way to code that but I see what you're saying. > Some > > > > refactoring there could be worthwhile. Happy to discuss that further > if > > > you > > > > feel you'd like to contribute something in that area. Perhaps start > > > another > > > > DISCUSS thread with some more details? > > > > > > > > As for your current PR, I still haven't had a chance to put the > thought > > > > I've wanted into this change. I will look to do that before the end > of > > > the > > > > week. At this point, however, I don't see anyone with any specific > > > > objections to the idea. As a logistical thought, I do think that > should > > > it > > > > go forward as is, it should be a change targeted at 3.5.0 though > rather > > > > than 3.4.10. > > > > > > > > > > > > > > > > > > On 2020/12/09 12:24:04, Stephen Mallette <spmalle...@gmail.com> > wrote: > > > > > > Thanks for posting. In the math department, I think that these > two > > > steps > > > > > > are asked for commonly and I think we have reached a point where > the > > > > > things > > > > > > folks are doing with Gremlin are requiring steps of greater > > > specificity > > > > > so > > > > > > this conversation is definitely expected. We currently have two > > > sorts of > > > > > > steps for operating on numbers: reducing steps like sum() and > then > > > math() > > > > > > step for expressions. It's interesting what you can accomplish > with > > > those > > > > > > two steps - note here how Kelvin manages standard deviation > without > > > > > lambdas: > > > > > > > > > > > > g.V().hasLabel('airport'). > > > > > > values('runways').fold().as('runways'). > > > > > > mean(local).as('mean'). > > > > > > select('runways').unfold(). > > > > > > math('(_-mean)^2').mean().math('sqrt(_)') > > > > > > > > > > > > > https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone > > > > > > > > > > > > In any case, we can see that there is a fair bit of indirection > > > there to > > > > > do > > > > > > the work of a simple stdev() step. I've often wondered if math() > > > could > > > > > > behave both in the way it does now and as a form of reducing > step. In > > > > > that > > > > > > way we could quietly add new math functions without forming new > > > steps, > > > > > as I > > > > > > can't help imaging that the addition of stdev() and percentile() > will > > > > > then > > > > > > follow with: variance(), covariance(), confidence() and so on. > > > Kelvin > > > > > > recently asked me about mult() for use cases that he sees from > time > > > to > > > > > time. > > > > > > > > > > > > As it stands our math expression library exp4j: > > > > > > > > > > > > https://www.objecthunter.net/exp4j/ > > > > > > > > > > > > is good at extensibility but isn't' really formed well out of the > > > box to > > > > > > handle reducing operations because its architecture forces you to > > > specify > > > > > > the number of arguments it will take up front and those arguments > > > must be > > > > > > double: > > > > > > > > > > > > https://www.objecthunter.net/exp4j/#Custom_functions > > > > > > > > > > > > So, that would be an issue to contend with, but technical issues > > > aside > > > > > and > > > > > > focusing instead on the user angle, would math() that worked as > > > follows > > > > > be > > > > > > a good path? > > > > > > > > > > > > gremlin> g.V().values('ages').fold().math(local, "stdev(_)") > > > > > > ==>0.816 > > > > > > gremlin> g.inject([1,2,3]).math(local, "product(_)") > > > > > > ==>6 > > > > > > > > > > > > And then, what distinction would there be between a math() step > and > > > first > > > > > > class "math steps" like sum(), min(), max(), and mean()? in other > > > words, > > > > > > why would those exist if math() could already do it all? What > makes a > > > > > math > > > > > > operation "common" enough to beget its own first class > > > representation? > > > > > > > > > > > > Just to be clear, I'm not saying we shouldn't add > > > stdev()/percentile() - > > > > > I > > > > > > just want to consider all the design possibilities and talk them > > > through. > > > > > > Thanks again for bringing up this conversation. I will link this > > > thread > > > > > to > > > > > > your JIRA for reference. > > > > > > > > > > > > > > > > > > On Wed, Dec 9, 2020 at 6:40 AM js guo <jsguo8...@gmail.com> > wrote: > > > > > > > > > > > > > Hi team, > > > > > > > > > > > > > > We are using tinkerpop Gremlin in our risk detection cases. > Some > > > > > analytical > > > > > > > calculations are used frequently, yet there is no corresponding > > > steps > > > > > in > > > > > > > hand. > > > > > > > > > > > > > > I am thinking that some general analytical steps can be added > in > > > > > Gremlin. > > > > > > > e.g. steps to calculate standard deviation and percentile. The > > > example > > > > > > > usage might be as follows. > > > > > > > -------------------------------- > > > > > > > gremlin> g.V().values('ages') > > > > > > > ==>1 > > > > > > > ==>2 > > > > > > > ==>3 > > > > > > > gremlin> g.V().values('ages').stdev() > > > > > > > ==>0.816 > > > > > > > gremlin> g.V().values('ages').fold().stdev(Scope.local) > > > > > > > ==>0.816 > > > > > > > > > > > > > > gremlin> g.V().values('ages').percentile(50) > > > > > > > ==>2 > > > > > > > // one percentile, return single value > > > > > > > gremlin> g.V().values('ages').percentile(0, 100) > > > > > > > ==>[0: 1, 100: 3] > > > > > > > // multiple percentiles, return a map > > > > > > > -------------------------------- > > > > > > > > > > > > > > Sorry for not emailing earlier, I have created a JIRA ticket > for > > > this > > > > > > > https://issues.apache.org/jira/browse/TINKERPOP-2487. > > > > > > > > > > > > > > As new steps are already used in our cases, we are glad to > offer > > > the > > > > > > > implementation for review, if you think it good to add the two > > > steps. > > > > > > > > > > > > > > Regards, > > > > > > > Junshi > > > > > > > > > > > > > > > > > > > > > > > > > > > >