Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Stephen Mallette Wed, 09 Dec 2020 04:24:39 -0800

Thanks for posting. In the math department, I think that these two steps
are asked for commonly and I think we have reached a point where the things
folks are doing with Gremlin are requiring steps of greater specificity so
this conversation is definitely expected. We currently have two sorts of
steps for operating on numbers: reducing steps like sum() and then math()
step for expressions. It's interesting what you can accomplish with those
two steps - note here how Kelvin manages standard deviation without lambdas:

g.V().hasLabel('airport').
      values('runways').fold().as('runways').
      mean(local).as('mean').
      select('runways').unfold().
      math('(_-mean)^2').mean().math('sqrt(_)')

https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#stddevone

In any case, we can see that there is a fair bit of indirection there to do
the work of a simple stdev() step. I've often wondered if math() could
behave both in the way it does now and as a form of reducing step. In that
way we could quietly add new math functions without forming new steps, as I
can't help imaging that the addition of stdev() and percentile() will then
follow with: variance(), covariance(), confidence() and so on.  Kelvin
recently asked me about mult() for use cases that he sees from time to time.

As it stands our math expression library exp4j:

https://www.objecthunter.net/exp4j/

is good at extensibility but isn't' really formed well out of the box to
handle reducing operations because its architecture forces you to specify
the number of arguments it will take up front and those arguments must be
double:

https://www.objecthunter.net/exp4j/#Custom_functions

So, that would be an issue to contend with, but technical issues aside and
focusing instead on the user angle, would math() that worked as follows be
a good path?

gremlin> g.V().values('ages').fold().math(local, "stdev(_)")
==>0.816
gremlin> g.inject([1,2,3]).math(local, "product(_)")
==>6

And then, what distinction would there be between a math() step and first
class "math steps" like sum(), min(), max(), and mean()? in other words,
why would those exist if math() could already do it all? What makes a math
operation "common" enough to beget its own first class representation?

Just to be clear, I'm not saying we shouldn't add stdev()/percentile() - I
just want to consider all the design possibilities and talk them through.
Thanks again for bringing up this conversation. I will link this thread to
your JIRA for reference.

On Wed, Dec 9, 2020 at 6:40 AM js guo <jsguo8...@gmail.com> wrote:

> Hi team,
>
> We are using tinkerpop Gremlin in our risk detection cases. Some analytical
> calculations are used frequently, yet there is no corresponding steps in
> hand.
>
> I am thinking that some general analytical steps can be added in Gremlin.
> e.g. steps to calculate standard deviation and percentile. The example
> usage might be as follows.
> --------------------------------
> gremlin> g.V().values('ages')
> ==>1
> ==>2
> ==>3
> gremlin> g.V().values('ages').stdev()
> ==>0.816
> gremlin> g.V().values('ages').fold().stdev(Scope.local)
> ==>0.816
>
> gremlin> g.V().values('ages').percentile(50)
> ==>2
> // one percentile, return single value
> gremlin> g.V().values('ages').percentile(0, 100)
> ==>[0: 1, 100: 3]
> // multiple percentiles, return a map
> --------------------------------
>
> Sorry for not emailing earlier, I have created a JIRA ticket for this
> https://issues.apache.org/jira/browse/TINKERPOP-2487.
>
> As new steps are already used in our cases, we are glad to offer the
> implementation for review, if you think it good to add the two steps.
>
> Regards,
> Junshi
>

Re: [New Step Discussion] Add Steps to Support Basic Distribution Analysis (e.g. Standard Deviation and Percentile)

Reply via email to