RE: Standard deviation (STDDEV) function calculation in Hive

Mich Talebzadeh Wed, 01 Apr 2015 01:23:10 -0700

Hi Lefty,


The Hive aggregate functions as you provide just states:

 


DOUBLE

stddev_pop(col)

Returns the standard deviation of a numeric column in the group.


DOUBLE

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group.

 

There is no mention of STDDEV here. So this is what I did with the previous 
query runni9ng and displaying STDDEV, STDDEV_SAMP and STDDDEV_POP

 

0: jdbc:hive2://rhes564:10010/default> SELECT cust_id AS Customer_ID,

. . . . . . . . . . . . . . . . . . .> SUM(amount_sold) AS 
Total_customer_amount,

. . . . . . . . . . . . . . . . . . .> AVG(amount_sold) AS Average_order,

. . . . . . . . . . . . . . . . . . .> STDDEV(amount_sold) AS STDDEV,

. . . . . . . . . . . . . . . . . . .> STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

. . . . . . . . . . . . . . . . . . .> STDDEV_POP(amount_sold) AS STDDEV_POP

. . . . . . . . . . . . . . . . . . .> FROM sales

. . . . . . . . . . . . . . . . . . .> GROUP BY cust_id

. . . . . . . . . . . . . . . . . . .> HAVING SUM(amount_sold) > 94000

. . . . . . . . . . . . . . . . . . .> AND AVG(amount_sold) < 
STDDEV_SAMP(amount_sold)

. . . . . . . . . . . . . . . . . . .> limit 3

. . . . . . . . . . . . . . . . . . .> ;

 

 

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| customer_id  | total_customer_amount  |    average_order    |       stddev    
   |    stddev_samp     |     stddev_pop     |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

| 429.0        | 94819.41000000006      | 410.4736363636366   | 
613.7057080691426  | 615.0384039014772  | 613.7057080691426  |

| 1743.0       | 94786.12999999993      | 398.2610504201678   | 
581.0439095219863  | 582.2684502048478  | 581.0439095219863  |

| 2994.0       | 94862.61000000006      | 417.89696035242315  | 
623.1607772763742  | 624.5379298449825  | 623.1607772763742  |

+--------------+------------------------+---------------------+--------------------+--------------------+--------------------+--+

3 rows selected (205.266 seconds)

 

So according to above what Hive calls STDDEV is a pointer to STDDEV_POP.

 

But Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP as shown in 
the results below from Oracle and Sybase respectively

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT AVERAGE_ORDER     STDDEV STDDEV_SAMP 
STDDEV_POP

----------- --------------------- ------------- ---------- ----------- 
----------

          429              94819.41    410.473636 615.038404  615.038404 
613.705708

         1743              94786.13     398.26105  582.26845   582.26845  
581.04391

         2994              94862.61     417.89696  624.53793   624.53793 
623.160777

 

 

 

Customer_ID                               Total_customer_amount                 
                  Average_order                       STDDEV                    
  STDDEV_SAMP                 STDDEV_POP

-------------------- ----------------------------------------- 
---------------------------------------------------- 
--------------------------- --------------------------- 
---------------------------

                  429                                  94819.41                 
                   410.4736363636363                  615.038404                
  615.038404                  613.705708

                 1743                                  94786.13                 
                   398.2610504201680                  582.268450                
  582.268450                  581.043910

                 2994                                  94862.61                 
                   417.8969603524229                  624.537930                
  624.537930                  623.160777

 

So may be the point goes beyond Hive documentation. The value provided by 
STDDEV in Hive does not appear to be industry standard

 

HTH

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Lefty Leverenz [mailto:leftylever...@gmail.com] 
Sent: 01 April 2015 07:14
To: user@hive.apache.org
Subject: Re: Standard deviation (STDDEV) function calculation in Hive

 

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is the 
same, I don't know.)  But the Hive documentation doesn't explain much:  
Built-in Aggregate Functions (UDAF) 
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)>
 .




-- Lefty

 

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

Hi,

 

Basically, the standard deviation (STDDEV) is a measure that is used to 
quantify the amount of variation or dispersion of a set of data values. It is 
widely used in trading systems such as FX.

 

STDEDV in Hive is explained here 
<http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/topics/impala_stddev.html>
  as below and I quote:

 

The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard 
deviation and sample standard deviation, respectively, of the input values. 
(STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input 
rows matched by the query. The difference is that STDDEV_SAMP() is scaled by 
1/(N-1) while STDDEV_POP() is scaled by 1/N.

 

Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and 
according to above it ought to be STDDEV_SAMP.

 

However, when I work these out (and also use straight forward calculation 
myself), it turns out that the alias seems to be to STDDVE_POP as opposed to 
STDDEV_SAMP!.

 

The following calculation shows this

 

SELECT

          rs.Customer_ID

        , rs.Total_customer_amount

       , rs.stddev

        , rs.sdddev_samp

        , rs.mystddev

FROM

(

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        stddev(amount_sold) AS stddev,

        stddev_samp(amount_sold) AS sdddev_samp,

        CASE

        WHEN  COUNT (amount_sold) <= 1

           THEN  0

         ELSE  
SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1)-1))

         END AS mystddev

        FROM sales

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < stddev(amount_sold)

) rs

ORDER BY

          rs.Total_customer_amount DESC

;

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| rs.customer_id  | rs.total_customer_amount  |     rs.stddev      |   
rs.sdddev_samp   |    rs.mystddev     |

+-----------------+---------------------------+--------------------+--------------------+--------------------+--+

| 11407.0         | 103412.65999999995        | 622.221465710723   | 
623.4797510518939  | 623.4797510518938  |

| 10747.0         | 99578.08999999997         | 600.7615005975689  | 
601.9383117167412  | 601.9383117167412  |

 

OK so looking above, we notice that rs.sdddev_samp and rs.mystddev are 
practically identical, whereas what is referred to as rs.stddev in Hive is not 
the one used in industry?  

 

To show I ran the same in Oracle and the below is the result.

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

 

CUSTOMER_ID TOTAL_CUSTOMER_AMOUNT                STDDEV              MYSTDDEV

----------- --------------------- --------------------- ---------------------

      11407             103412.66     623.4797510518940     623.4797510518940

      10747              99578.09     601.9383117167410     601.9383117167410

 

So sounds like for one reason or other what is called STDDEV in Hive and 
aliased to STDDEV_SAMP is incorrect?

 

Thanks,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

RE: Standard deviation (STDDEV) function calculation in Hive

Reply via email to