I’m guessing ETL_ROW_ID should be unique but not necessarily contain only 
consecutive numbers?

From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
Sent: Tuesday, June 28, 2016 10:57 PM
To: Markovitz, Dudu <dmarkov...@paypal.com>
Cc: user@hive.apache.org
Subject: Re: Query Performance Issue : Group By and Distinct and load on reducer

Hi Dudu,

You are correct ...ROW_NUMBER() is main culprit.

ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution?



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu 
<dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote:
The row_number operation seems to be skewed.

Dudu

From: @Sanjiv Singh 
[mailto:sanjiv.is...@gmail.com<mailto:sanjiv.is...@gmail.com>]
Sent: Tuesday, June 28, 2016 8:54 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Query Performance Issue : Group By and Distinct and load on reducer

Hi All,

I am having performance issue with data skew of the distinct statement in 
Hive<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
 See below query with DISTINCT operator.
Original Query :

SELECT DISTINCT
                 SD.REGION
                                ,SD.HEADEND
                                ,SD.NETWORK
                                ,SD.RETAILUNITCODE
                                ,SD.LOGTIMEDATE
                                ,SD.SPOTKEY
                                ,SD.CRE_DT
                                ,CASE
                                                WHEN SD.LOGTIMEDATE IS NULL
                                                                THEN 'Y'
                                                ELSE 'N'
                                                END AS DROP_REASON
                                ,ROW_NUMBER() OVER (
                                                ORDER BY NULL
                                                ) AS ETL_ROW_ID
                FROM INTER_ETL AS SD;

Table INTER_ETL used for query is big enough.
From the logs , it seems that data skew for specific set of values , causing 
one of reducer have to do all the job. I tried to achieve the same through 
GROUP BY still having the same issue.  Help me to understand the issue and 
resolution.
Query with Distinct V2 :

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT DISTINCT dt.*
FROM (
        SELECT SD.REGION
                    ,SD.HEADEND
                    ,SD.NETWORK
                    ,SD.RETAILUNITCODE
                    ,SD.LOGTIMEDATE
                    ,SD.SPOTKEY
                    ,SD.CRE_DT
                    ,CASE
                                WHEN SD.LOGTIMEDATE IS NULL
                                            THEN 'Y'
                                ELSE 'N'
                                END AS DROP_REASON
                    ,ROW_NUMBER() OVER (
                                ORDER BY NULL
                                ) AS ETL_ROW_ID
        FROM INTER_ETL AS SD
        ) AS dt;

Logs:

INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418


Query With Group By:

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT REGION
                    ,HEADEND
                    ,NETWORK
                    ,RETAILUNITCODE
                    ,LOGTIMEDATE
                    ,SPOTKEY
                    ,CRE_DT
                    ,DROP_REASON
                    ,ETL_ROW_ID
FROM (
        SELECT SD.REGION
                    ,SD.HEADEND
                    ,SD.NETWORK
                    ,SD.RETAILUNITCODE
                    ,SD.LOGTIMEDATE
                    ,SD.SPOTKEY
                    ,SD.CRE_DT
                    ,CASE
                                WHEN SD.LOGTIMEDATE IS NULL
                                            THEN 'Y'
                                ELSE 'N'
                                END AS DROP_REASON
                    ,ROW_NUMBER() OVER (
                                ORDER BY NULL
                                ) AS ETL_ROW_ID
        FROM INTER_ETL AS SD
        ) AS dt
GROUP BY
         REGION
                    ,HEADEND
                    ,NETWORK
                    ,RETAILUNITCODE
                    ,LOGTIMEDATE
                    ,SPOTKEY
                    ,CRE_DT
                    ,DROP_REASON
                    ,ETL_ROW_ID;

Logs:

INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418

Table details :

Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                 DFS Output    
                                                             |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
| Found 15 items                                                                
                                                             |
| -rwxrwxrwx   3 Z56034 hdfs 2075244899 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2090030620 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2025516774 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1986848213 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0  |
| -rwxrwxrwx   3 Z56034 hdfs 2018883723 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1984690335 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1987494444 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1974674515 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1963720218 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1965892384 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1974272622 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1971948208 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1968141886 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0  |
| -rwxrwxrwx   3 Z56034 hdfs 1970930771 2016-06-28 10:24 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0  |
| -rwxrwxrwx   3 Z56034 hdfs  192820628 2016-06-28 10:23 
/apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0  |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
 'numFiles'='15',
 'numRows'='108363614',



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

Reply via email to