I know stddev_samp and stddev_pop gives different values, because they
have different definition. What I want to know is why stddev_samp
gives "NaN", and not a numeric value.

On Thu, Jul 7, 2016 at 5:39 PM, Sean Owen <so...@cloudera.com> wrote:
> I don't think that's relevant here. The question is why would samp
> give a different result to pop, not the result of "stddev". Neither
> one is a 'correct' definition of standard deviation in the abstract;
> one or the other is correct depending on what standard deviation you
> are trying to measure.
>
> On Thu, Jul 7, 2016 at 9:37 AM, Mich Talebzadeh
> <mich.talebza...@gmail.com> wrote:
>> The correct STDDEV function used is STDDEV_SAMP not STDDEV_POP.  That is the
>> correct one.
>>
>> You can actually work that one out yourself
>>
>>
>> BTW Hive also gives a wrong value. This is what I reported back in April
>> about Hive giving incorrect value
>>
>> Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I
>> did tests with Spark 1.6 as well.  Spark correctly points STTDEV to
>> STDDEV_SAMP.
>>
>> The following query was used
>>
>> SELECT
>>
>> SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))
>> AS MYSTDDEV,
>>         STDDEV(amount_sold) AS STDDEV,
>>         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
>>         STDDEV_POP(amount_sold) AS STDDEV_POP
>> from    sales;
>>
>> The following is from running the above query on Hive where STDDEV -->
>> STDDEV_POP which is incorrect
>>
>>
>> +--------------------+---------------------+--------------------+---------------------+--+
>> |      mystddev      |       stddev        |    stddev_samp     |
>> stddev_pop      |
>> +--------------------+---------------------+--------------------+---------------------+--+
>> | 260.7270919450411  | 260.72704617040444  | 260.7270722861465  |
>> 260.72704617040444  |
>> +--------------------+---------------------+--------------------+---------------------+--+
>>
>> The following is from Spark-sql where STDDEV -->  STDDEV_SAMP which is
>> correct
>>
>> +--------------------+---------------------+--------------------+---------------------+--+
>> |      mystddev      |       stddev        |    stddev_samp     |
>> stddev_pop      |
>> +--------------------+---------------------+--------------------+---------------------+--+
>> | 260.7270919450411  | 260.7270722861637   | 260.7270722861637  |
>> 260.72704617042166  |
>> +--------------------+---------------------+--------------------+---------------------+--+
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 7 July 2016 at 09:29, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> No, because these are different values defined differently. If you
>>> have 1 data point, the sample stdev is undefined while population
>>> stdev is defined. Refer to their definition.
>>>
>>> On Thu, Jul 7, 2016 at 9:23 AM, Mungeol Heo <mungeol....@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > As I mentioned at the title, stddev_samp function gives a NaN while
>>> > stddev_pop gives a numeric value on the same data.
>>> > The stddev_samp function will give a numeric value, if I cast it to
>>> > decimal.
>>> > E.g. cast(stddev_samp(column_name) as decimal(16,3))
>>> > Is it a bug?
>>> >
>>> > Thanks
>>> >
>>> > - mungeol
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to