[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

Hoss Man (JIRA) Thu, 07 Dec 2017 11:13:19 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282354#comment-16282354
 ]


Hoss Man commented on SOLR-11725:
---------------------------------



bq. This does bring up the question of what to do when N=1 (or N=0 for that 
matter).

I ommitted them from my original description for brevity to focus on the bigger 
picture of the equations, but for the record the full implemetnion of stddev in 
each of the two classes mentioned are...

* {{StddevAgg.java}}: {code}
double val = count == 0 ? 0.0d : Math.sqrt((sumSq/count)-Math.pow(sum/count, 
2));
return val;
{code}
* {{StatsValuesFactory.java}}: {code}
if (count <= 1.0D) {
  return 0.0D;
}

return Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
1.0D)));
{code}


bq. When N=0, the current code produces 0, but I don't think that's the best 
choice. ...

Agreed, it should really be 'null' (or 'NaN')

(i'm not sure why {{StatsValuesFactory.java}} currently returns {{0.0D}} when 
{{count==0}} ... other {{StatsValuesFactory.java}} stats like min/max correctly 
return 'null' ... it's weird)

bq. ...In general we've been moving toward omitting undefined functions. Stats 
like min() and max() already do this.

Whoa... really? ... that seems like it would make th client parsing realy 
hard...

You're saying users can't expect that every "facet" key they specify in the 
request will be include in the response? (in the event it's 'null' or 'NaN' or 
whatever makes sense given it's data type)  Why???

bq. I'd be tempted to treat N=0 and N=1 as undefined

As I said, for N=0 I agree with you that the result should be 
"undefined/null/NaN" (and if that means that it's excluded from the response to 
be consistent with the existing behavior in {{json.facet}} then so be it) ... 
but i'm a big "-1" (vote, i mean, not math) on treating stddev(N=1) as 
"undefined" ... that makes no sense to me.  

For a singleton set, the stddev() should *absolutely* be "0" -- all of the 
value(s) in the set are identical, the amount of deviation between the value(s) 
in set is "none".  For the purpose of comparing the "consistency" of this set 
to any other sets, you know that this set is as consistent as it can possibly 
be.

Why sould the {{stddv(\[42]}}} be any different then the 
{{stddev(\[42,42,42,42,42,....])}} ????

bq. Oh, and whatever treatment we give stddev(), we should presumably give to 
variance()?

I would asssume so, but first i'd have to go refresh my memory on how exactly 
variance differs from stddev :)




> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11725
>                 URL: https://issues.apache.org/jira/browse/SOLR-11725
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

Reply via email to