Thanks. I think nPk and nCk calculations are different from the reducing barrier step candidates.
For nPk and nCk type calculations, as long as `factorial` is supported by `math` step, I think we can manage it by using `factorial`, *` and `/`. But if `factorial` is supported in `math` step, I think it would not take much effort to support nPk and nCk as well. Regards, Junshi On 2020/12/14 20:03:01, Kelvin Lawrence <gfx...@yahoo.com.INVALID> wrote: > I quite like the idea of adding some additional reducing barrier steps that > make it easy to do simple statistical analysis on streams of numbers. Likely > candidates are `stddev`, `variance` and `product`. It's not easy today to > calculate the product of a stream in Gremlin using the `math` step. I > considered suggesting `factorial` also to aid with nPk and nCk type > calculations but I am not sure about that one. > Cheers,Kelvin > Kelvin R. Lawrence > > On Thursday, December 10, 2020, 11:53:53 PM CST, js guo > <jsguo8...@gmail.com> wrote: > > Thanks for the reply. It is a good idea to provide reducing operations > through math() step. But from my understanding, we still need different > reducing steps or at least different seed suppliers and reducing operators in > the back-end. > > gremlin> g.V().values('age').fold().math(local, "stdev(_)") > ==>0.816 > gremlin> g.inject([1,2,3]).math(local, "product(_)") > ==>6 > > One of the advantage of a reducing step is that we do not need to hold the > whole collection of numbers. Take standard deviation calculation for example, > Kelvin’s solution requires manipulation of number arrays. With a reducing > step, we can accumulate value sum, square sum and count during the traversal > and get a final result with sqrt((E(X)^2 - E(X^2)). The latter has a better > performance together with potentially lower memory requirement. > > Maybe for math() step, when passing its scope as global, we can replace it > with a reducing step internally. The main change is how users write queries > with little change in underlying implementation. This way, we can align math > functions into one single step, which I think is the right way to go > considering that there might be more and more analytical functions to be > supported. BTW, users still need to remember what steps are supported by > math(). > > gremlin> g.V().values('age').fold().math(local, “mean(_)”) // default local > scope, accepts array > ==>30.75 > gremlin> g.V().values('age').math(global, “mean(_)”) // internal execution > with MeanGlobalStep > ==>30.75 > > A further thinking performance wise. In ReducingBarrierStep implementation, > “projectTraverser” is used to project current traversal into single value and > a “BinaryOperator” is used to reduce multiple single-values into one. For > number manipulation, this process involves a lot of boxing and unboxing, and > also object creation (e.g. creating MeanNumber for MeanGlobalStep > https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59). > > > I have wondered if we can optimize the reducing framework for number related > steps, like store intermediate values (for MeanGlobalStep, it is the “count” > and value “sum”) as step instance variables, and the reducing operation > happens directly in traverser projection? > > On 2020/12/09 12:24:04, Stephen Mallette <spmalle...@gmail.com> wrote: > > Thanks for posting. In the math department, I think that these two steps > > are asked for commonly and I think we have reached a point where the things > > folks are doing with Gremlin are requiring steps of greater specificity so > > this conversation is definitely expected. We currently have two sorts of > > steps for operating on numbers: reducing steps like sum() and then math() > > step for expressions. It's interesting what you can accomplish with those > > two steps - note here how Kelvin manages standard deviation without lambdas: > > > > g.V().hasLabel('airport'). > > values('runways').fold().as('runways'). > > mean(local).as('mean'). > > select('runways').unfold(). > > math('(_-mean)^2').mean().math('sqrt(_)') > > > > https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone > > > > In any case, we can see that there is a fair bit of indirection there to do > > the work of a simple stdev() step. I've often wondered if math() could > > behave both in the way it does now and as a form of reducing step. In that > > way we could quietly add new math functions without forming new steps, as I > > can't help imaging that the addition of stdev() and percentile() will then > > follow with: variance(), covariance(), confidence() and so on. Kelvin > > recently asked me about mult() for use cases that he sees from time to time. > > > > As it stands our math expression library exp4j: > > > > https://www.objecthunter.net/exp4j/ > > > > is good at extensibility but isn't' really formed well out of the box to > > handle reducing operations because its architecture forces you to specify > > the number of arguments it will take up front and those arguments must be > > double: > > > > https://www.objecthunter.net/exp4j/#Custom_functions > > > > So, that would be an issue to contend with, but technical issues aside and > > focusing instead on the user angle, would math() that worked as follows be > > a good path? > > > > gremlin> g.V().values('ages').fold().math(local, "stdev(_)") > > ==>0.816 > > gremlin> g.inject([1,2,3]).math(local, "product(_)") > > ==>6 > > > > And then, what distinction would there be between a math() step and first > > class "math steps" like sum(), min(), max(), and mean()? in other words, > > why would those exist if math() could already do it all? What makes a math > > operation "common" enough to beget its own first class representation? > > > > Just to be clear, I'm not saying we shouldn't add stdev()/percentile() - I > > just want to consider all the design possibilities and talk them through. > > Thanks again for bringing up this conversation. I will link this thread to > > your JIRA for reference. > > > > > > On Wed, Dec 9, 2020 at 6:40 AM js guo <jsguo8...@gmail.com> wrote: > > > > > Hi team, > > > > > > We are using tinkerpop Gremlin in our risk detection cases. Some > > > analytical > > > calculations are used frequently, yet there is no corresponding steps in > > > hand. > > > > > > I am thinking that some general analytical steps can be added in Gremlin. > > > e.g. steps to calculate standard deviation and percentile. The example > > > usage might be as follows. > > > -------------------------------- > > > gremlin> g.V().values('ages') > > > ==>1 > > > ==>2 > > > ==>3 > > > gremlin> g.V().values('ages').stdev() > > > ==>0.816 > > > gremlin> g.V().values('ages').fold().stdev(Scope.local) > > > ==>0.816 > > > > > > gremlin> g.V().values('ages').percentile(50) > > > ==>2 > > > // one percentile, return single value > > > gremlin> g.V().values('ages').percentile(0, 100) > > > ==>[0: 1, 100: 3] > > > // multiple percentiles, return a map > > > -------------------------------- > > > > > > Sorry for not emailing earlier, I have created a JIRA ticket for this > > > https://issues.apache.org/jira/browse/TINKERPOP-2487. > > > > > > As new steps are already used in our cases, we are glad to offer the > > > implementation for review, if you think it good to add the two steps. > > > > > > Regards, > > > Junshi > > > > > >