The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is
the correct one.

You can actually work that one out yourself


BTW Hive also gives a wrong value. This is what I reported back in April
about Hive giving incorrect value

Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
STDDEV_SAMP.

The following query was used

SELECT

SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;

The following is from running the above query on Hive where STDDEV -->
 STDDEV_POP which is incorrect

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
260.72704617040444
|
+--------------------+---------------------+--------------------+---------------------+--+

The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
correct

+--------------------+---------------------+--------------------+---------------------+--+
|      mystddev      |       stddev        |    stddev_samp     |
stddev_pop      |
+--------------------+---------------------+--------------------+---------------------+--+
| 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
260.72704617042166  |
+--------------------+---------------------+--------------------+---------------------+--+

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:

> No, because these are different values defined differently. If you
> have 1 data point, the sample stdev is undefined while population
> stdev is defined. Refer to their definition.
>
> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mungeol....@gmail.com> wrote:
> > Hello,
> >
> > As I mentioned at the title, stddev_samp function gives a NaN while
> > stddev_pop gives a numeric value on the same data.
> > The stddev_samp function will give a numeric value, if I cast it to
> decimal.
> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
> > Is it a bug?
> >
> > Thanks
> >
> > - mungeol
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to