I quite like the idea of adding some additional reducing barrier steps that make it easy to do simple statistical analysis on streams of numbers. Likely candidates are `stddev`, `variance` and `product`. It's not easy today to calculate the product of a stream in Gremlin using the `math` step. I considered suggesting `factorial` also to aid with nPk and nCk type calculations but I am not sure about that one. Cheers,Kelvin Kelvin R. Lawrence
On Thursday, December 10, 2020, 11:53:53 PM CST, js guo <jsguo8...@gmail.com> wrote: Thanks for the reply. It is a good idea to provide reducing operations through math() step. But from my understanding, we still need different reducing steps or at least different seed suppliers and reducing operators in the back-end. gremlin> g.V().values('age').fold().math(local, "stdev(_)") ==>0.816 gremlin> g.inject([1,2,3]).math(local, "product(_)") ==>6 One of the advantage of a reducing step is that we do not need to hold the whole collection of numbers. Take standard deviation calculation for example, Kelvin’s solution requires manipulation of number arrays. With a reducing step, we can accumulate value sum, square sum and count during the traversal and get a final result with sqrt((E(X)^2 - E(X^2)). The latter has a better performance together with potentially lower memory requirement. Maybe for math() step, when passing its scope as global, we can replace it with a reducing step internally. The main change is how users write queries with little change in underlying implementation. This way, we can align math functions into one single step, which I think is the right way to go considering that there might be more and more analytical functions to be supported. BTW, users still need to remember what steps are supported by math(). gremlin> g.V().values('age').fold().math(local, “mean(_)”) // default local scope, accepts array ==>30.75 gremlin> g.V().values('age').math(global, “mean(_)”) // internal execution with MeanGlobalStep ==>30.75 A further thinking performance wise. In ReducingBarrierStep implementation, “projectTraverser” is used to project current traversal into single value and a “BinaryOperator” is used to reduce multiple single-values into one. For number manipulation, this process involves a lot of boxing and unboxing, and also object creation (e.g. creating MeanNumber for MeanGlobalStep https://github.com/apache/tinkerpop/blob/3.4-dev/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/map/MeanGlobalStep.java#L59). I have wondered if we can optimize the reducing framework for number related steps, like store intermediate values (for MeanGlobalStep, it is the “count” and value “sum”) as step instance variables, and the reducing operation happens directly in traverser projection? On 2020/12/09 12:24:04, Stephen Mallette <spmalle...@gmail.com> wrote: > Thanks for posting. In the math department, I think that these two steps > are asked for commonly and I think we have reached a point where the things > folks are doing with Gremlin are requiring steps of greater specificity so > this conversation is definitely expected. We currently have two sorts of > steps for operating on numbers: reducing steps like sum() and then math() > step for expressions. It's interesting what you can accomplish with those > two steps - note here how Kelvin manages standard deviation without lambdas: > > g.V().hasLabel('airport'). > values('runways').fold().as('runways'). > mean(local).as('mean'). > select('runways').unfold(). > math('(_-mean)^2').mean().math('sqrt(_)') > > https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone > > In any case, we can see that there is a fair bit of indirection there to do > the work of a simple stdev() step. I've often wondered if math() could > behave both in the way it does now and as a form of reducing step. In that > way we could quietly add new math functions without forming new steps, as I > can't help imaging that the addition of stdev() and percentile() will then > follow with: variance(), covariance(), confidence() and so on. Kelvin > recently asked me about mult() for use cases that he sees from time to time. > > As it stands our math expression library exp4j: > > https://www.objecthunter.net/exp4j/ > > is good at extensibility but isn't' really formed well out of the box to > handle reducing operations because its architecture forces you to specify > the number of arguments it will take up front and those arguments must be > double: > > https://www.objecthunter.net/exp4j/#Custom_functions > > So, that would be an issue to contend with, but technical issues aside and > focusing instead on the user angle, would math() that worked as follows be > a good path? > > gremlin> g.V().values('ages').fold().math(local, "stdev(_)") > ==>0.816 > gremlin> g.inject([1,2,3]).math(local, "product(_)") > ==>6 > > And then, what distinction would there be between a math() step and first > class "math steps" like sum(), min(), max(), and mean()? in other words, > why would those exist if math() could already do it all? What makes a math > operation "common" enough to beget its own first class representation? > > Just to be clear, I'm not saying we shouldn't add stdev()/percentile() - I > just want to consider all the design possibilities and talk them through. > Thanks again for bringing up this conversation. I will link this thread to > your JIRA for reference. > > > On Wed, Dec 9, 2020 at 6:40 AM js guo <jsguo8...@gmail.com> wrote: > > > Hi team, > > > > We are using tinkerpop Gremlin in our risk detection cases. Some analytical > > calculations are used frequently, yet there is no corresponding steps in > > hand. > > > > I am thinking that some general analytical steps can be added in Gremlin. > > e.g. steps to calculate standard deviation and percentile. The example > > usage might be as follows. > > -------------------------------- > > gremlin> g.V().values('ages') > > ==>1 > > ==>2 > > ==>3 > > gremlin> g.V().values('ages').stdev() > > ==>0.816 > > gremlin> g.V().values('ages').fold().stdev(Scope.local) > > ==>0.816 > > > > gremlin> g.V().values('ages').percentile(50) > > ==>2 > > // one percentile, return single value > > gremlin> g.V().values('ages').percentile(0, 100) > > ==>[0: 1, 100: 3] > > // multiple percentiles, return a map > > -------------------------------- > > > > Sorry for not emailing earlier, I have created a JIRA ticket for this > > https://issues.apache.org/jira/browse/TINKERPOP-2487. > > > > As new steps are already used in our cases, we are glad to offer the > > implementation for review, if you think it good to add the two steps. > > > > Regards, > > Junshi > > >