Re: Query Performance Issue : Group By and Distinct and load on reducer

@Sanjiv Singh Mon, 18 Jul 2016 09:27:41 -0700

Hi Dudu,

Thanks for your help and proactive response on it.


Today I have verified all solution you provided. it worked for me for given
table with 6 billion records.

Before I conclude anything, i want to check if there is any reference
document/link  available for these algorithm / approach

It would be good if you can share with him.

Any help really appreciated.
thanks much again



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Fri, Jul 1, 2016 at 2:06 PM, Markovitz, Dudu <dmarkov...@paypal.com>
wrote:

> My pleasure.
>
>
>
> Just to make clear –
>
> The version with the non-consecutive values (1) is much more efficient
> than the version with the consecutive values (3), so if possible, go with
> (1).
>
>
>
> Dudu
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
> *Sent:* Friday, July 01, 2016 8:24 PM
>
> *To:* Markovitz, Dudu <dmarkov...@paypal.com>
> *Cc:* user@hive.apache.org
> *Subject:* Re: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> Thanks, really appreciate.
>
>
>
> I will try this. will respond with results.
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
> On Fri, Jul 1, 2016 at 6:50 AM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> 3.
>
> This is a working code for consecutive values.
>
> MyColumn should be a column (or list of columns) with good uniformed
> distribution.
>
>
>
>
>
> with        group_rows
>
>             as
>
>             (
>
>                 select      abs(hash(MyColumn))%10000     as group_id
>
>                            ,count (*)   as cnt
>
>
>
>                 from        INTER_ETL
>
>
>
>                 group by    abs(hash(MyColumn))%10000
>
>                 )
>
>
>
>            ,group_rows_accumulated
>
>             as
>
>             (
>
>                 select      g1.group_id
>
>                            ,sum (g2.cnt) - min (g1.cnt)   as
> accumulated_rows
>
>
>
>                 from
>
>                                         group_rows   as g1
>
>
>
>                             cross join  group_rows   as g2
>
>
>
>                 where       g2.group_id <= g1.group_id
>
>
>
>                 group by    g1.group_id
>
>             )
>
>
>
>  select     t.*
>
>            ,row_number () over (partition by a.group_id order by null) +
> a.accumulated_rows as ETL_ROW_ID
>
>
>
> from                   INTER_ETL               as t
>
>
>
>             join        group_rows_accumulated  as a
>
>
>
>             on          a.group_id  =
>
>                         abs(hash(MyColumn))%10000
>
> ;
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com]
> *Sent:* Thursday, June 30, 2016 12:43 PM
> *To:* user@hive.apache.org; sanjiv.is...@gmail.com
>
>
> *Subject:* RE: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> 1.
>
> This works.
>
> I’ve recalled that the CAST is needed since FLOOR defaults to FLOAT.
>
>
>
> select      (cast (floor(r*1000000) as bigint)+ 1)  + 1000000L *
> (row_number () over (partition by (cast (floor(r*1000000) as bigint) + 1)
> order by null) - 1)  as ETL_ROW_ID
>
>
>
> from        (select *,rand() as r from INTER_ETL) as t
>
> ;
>
>
>
>
>
>
>
> Here is a test result from our dev system
>
>
>
> select      min     (ETL_ROW_ID)    as min_ETL_ROW_ID
>
>            ,count   (ETL_ROW_ID)    as count_ETL_ROW_ID
>
>            ,max     (ETL_ROW_ID)    as max_ETL_ROW_ID
>
>
>
> from       (select      (cast (floor(r*1000000) as bigint)+ 1)  + 1000000L
> * (row_number () over (partition by (cast (floor(r*1000000) as bigint) + 1)
> order by null) - 1)  as ETL_ROW_ID
>
>
>
>             from        (select *,rand() as r from INTER_ETL) as t
>
>             )
>
>             as t
>
> ;
>
>
>
>
>
> min_ETL_ROW_ID
>
> count_ETL_ROW_ID
>
> max_ETL_ROW_ID
>
>                                    1
>
>            39567412227
>
>          40529759537
>
>
>
>
>
>
>
> *From:* Markovitz, Dudu [mailto:dmarkov...@paypal.com
> <dmarkov...@paypal.com>]
> *Sent:* Wednesday, June 29, 2016 11:37 PM
> *To:* sanjiv.is...@gmail.com
> *Cc:* user@hive.apache.org
> *Subject:* RE: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> 1.
>
> This is strange.
>
> The negative numbers are due to overflow of the ‘int’ type, but for that
> reason exactly I’ve casted the expressions in my code to ‘bigint’.
>
> I’ve tested this code before sending it to you and it worked fine,
> returning results that are beyond the range of the ‘int’ type.
>
>
>
> Please try this:
>
>
>
> select      *
>
>           ,(floor(r*1000000) + 1)  + (1000000L * (row_number () over
> (partition by (floor(r*1000000) + 1) order by null) - 1)  as ETL_ROW_ID
>
>
>
> from        (select *,rand() as r from INTER_ETL) as t
>
> ;
>
>
>
> 2.
>
> Great
>
>
>
> 3.
>
> Sorry, hadn’t had the time to test it (nor the change I’m going to suggest
> now…J)
>
> Please check if the following code works and if so, replace the ‘a’
> subquery code with it.
>
>
>
>
>
> select      a1.group_id
>
>            ,sum (a2.cnt) - a1.cnt   as accum_rows
>
>
>
> from                   (select      abs(hash(MyCol1,MyCol2))%1000  as group_id
>
>                                    ,count (*)                      as cnt
>
>
>
>                         from        INTER_ETL
>
>
>
>                         group by    abs(hash(MyCol1,MyCol2))%1000
>
>                         )
>
>                         as a1
>
>
>
>             cross join  (select     abs(hash(MyCol1,MyCol2))%1000   as 
> group_id
>
>                                    ,count (*)                       as cnt
>
>
>
>                         from        INTER_ETL
>
>
>
>                         group by    abs(hash(MyCol1,MyCol2))%1000
>
>                         )
>
>                         as a2
>
>
>
> where       a2.group_id <= a1.group_id
>
>
>
> group by    a1.group_id
>
> ;
>
>
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com
> <sanjiv.is...@gmail.com>]
> *Sent:* Wednesday, June 29, 2016 10:55 PM
> *To:* Markovitz, Dudu <dmarkov...@paypal.com>
> *Cc:* user@hive.apache.org
> *Subject:* Re: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> Hi Dudu,
>
>
>
> I tried the same on same table which has 6357592675 rows. See response of
> all three.
>
>
>
>
>
> *I tried 1st one , its giving duplicates for rows. *
>
>
>
> > CREATE TEMPORARY TABLE INTER_ETL_T AS
> select  *
> ,cast (floor(r*1000000) + 1 as bigint) + (1000000 * (row_number () over
> (partition by cast (floor(r*1000000) + 1 as bigint) order by null) - 1))
>  as ROW_NUM
> from        (select *,rand() as r from INTER_ETL) as t ;
>
>
>
>
>
> > select ROW_NUM, count(*) from INTER_ETL_T by ROW_NUM having count(*) > 1
> limit 10;
>
>
>
> +--------------+------+--+
> |    ROW_NUM| _c1  |
> +--------------+------+--+
> | -2146932303  | 2    |
> | -2146924922  | 2    |
> | -2146922710  | 2    |
> | -2146901450  | 2    |
> | -2146897115  | 2    |
> | -2146874805  | 2    |
> | -2146869449  | 2    |
> | -2146865918  | 2    |
> | -2146864595  | 2    |
> | -2146857688  | 2    |
> +--------------+------+--+
>
>
>
> On 2nd one, it is not giving any duplicate and was much faster than
> ROW_NUMBER() atleast.
>
>
>
> numRows=6357592675, totalSize=405516934422, rawDataSize=399159341747
>
>
>
>
>
> *And on 3rd for consecutive number, query is not compatible to HIVE.*
>
>
>
> CREATE TEMPORARY TABLE INTER_ETL_T AS
> select      *
> ,a.accum_rows + row_number () over (partition by
> abs(hash(t.m_d_key,t.s_g_key))%10000 order by null) as ROW_NUM
> from                    INTER_ETL   as t
> join        (select     abs(hash(m_d_key,s_g_key))%10000   as group_id
> ,sum (count (*)) over (order by m_d_key,s_g_key rows between unbounded
> preceding and 1 preceding) - count(*)   as accum_rows
> from        INTER_ETL
> group by    abs(hash(m_d_key,s_g_key))%10000
> ) as a
> on  a.group_id  = abs(hash(t.m_d_key,t.s_g_key))%10000
> ;
>
>
>
> Error :
>
>
>
> Error: Error while compiling statement: FAILED: SemanticException End of a
> WindowFrame cannot be UNBOUNDED PRECEDING (state=42000,code=40000)
>
>
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
> On Tue, Jun 28, 2016 at 6:16 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
> thanks a lot.
>
> let me give it a try.
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
> On Tue, Jun 28, 2016 at 5:32 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> There’s a distributed algorithm for windows function that is based on the
> ORDER BY clause rather than the PARTITION BY clause.
>
> I doubt if is implemented in Hive, but it’s worth a shot.
>
>
>
> select      *
>
>            ,row_number () over (order by rand()) as ETL_ROW_ID
>
> from        INTER_ETL
>
> ;
>
>
>
> For unique, not consecutive values you can try this:
>
>
>
> select      *
>
>            ,cast (floor(r*1000000) + 1 as bigint) + (1000000 * (row_number
> () over (partition by cast (floor(r*1000000) + 1 as bigint) order by null)
> - 1))  as ETL_ROW_ID
>
>
>
> from        (select *,rand() as r from INTER_ETL) as t
>
> ;
>
>
>
> If you have in your table a column/combination of columns with unified
> distribution you can also do something like this:
>
>
>
> select      *
>
>            , (abs(hash(MyCol1,MyCol2))%1000000 + 1) + (row_number () over
> (partition by (abs(hash(MyCol1,MyCol2))%1000000 + 1) order by null) - 1)
> * 1000000L  as ETL_ROW_ID
>
>
>
> from        INTER_ETL
>
> ;
>
>
>
> For consecutive values you can do something (ugly…) like this:
>
>
>
> select      *
>
>            ,a.accum_rows + row_number () over (partition by
> abs(hash(t.MyCol1,t.MyCol2))%10000 order by null) as ETL_ROW_ID
>
>
>
> from                    INTER_ETL   as t
>
>
>
>             join        (select
> abs(hash(MyCol1,MyCol2))%10000
> as group_id
>
>                                    ,sum (count (*)) over (order by
> MyCol1,MyCol2 rows between unbounded preceding and 1 preceding) -
> count(*)   as accum_rows
>
>
>
>                         from        INTER_ETL
>
>
>
>                         group by    abs(hash(MyCol1,MyCol2))%10000
>
>                         )
>
>                         as a
>
>
>
>             on          a.group_id  =
> abs(hash(t.MyCol1,t.MyCol2))%10000
>
>
>
> ;
>
>
>
>
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
> *Sent:* Tuesday, June 28, 2016 11:52 PM
>
>
> *To:* Markovitz, Dudu <dmarkov...@paypal.com>
> *Cc:* user@hive.apache.org
> *Subject:* Re: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> ETL_ROW_ID is to be consecutive number. I need to check if having unique
> number would not break any logic.
>
>
>
> Considering unique number for ETL_ROW_ID column, what are optimum options
> available?
>
> What id it has to be consecutive number only?
>
>
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
> On Tue, Jun 28, 2016 at 4:17 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> I’m guessing ETL_ROW_ID should be unique but not necessarily contain only
> consecutive numbers?
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
> *Sent:* Tuesday, June 28, 2016 10:57 PM
> *To:* Markovitz, Dudu <dmarkov...@paypal.com>
> *Cc:* user@hive.apache.org
> *Subject:* Re: Query Performance Issue : Group By and Distinct and load
> on reducer
>
>
>
> Hi Dudu,
>
>
>
> You are correct ...ROW_NUMBER() is main culprit.
>
>
>
> ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution?
>
>
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
> On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu <dmarkov...@paypal.com>
> wrote:
>
> The row_number operation seems to be skewed.
>
>
>
> Dudu
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is...@gmail.com]
> *Sent:* Tuesday, June 28, 2016 8:54 PM
> *To:* user@hive.apache.org
> *Subject:* Query Performance Issue : Group By and Distinct and load on
> reducer
>
>
>
> Hi All,
>
>
> I am having performance issue with data skew of the distinct statement in
> Hive
> <http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
> See below query with DISTINCT operator.
>
> *Original Query : *
>
>
>
> SELECT DISTINCT
>
>                  SD.REGION
>
>                                 ,SD.HEADEND
>                                 ,SD.NETWORK
>                                 ,SD.RETAILUNITCODE
>                                 ,SD.LOGTIMEDATE
>                                 ,SD.SPOTKEY
>                                 ,SD.CRE_DT
>                                 ,CASE
>                                                 WHEN SD.LOGTIMEDATE IS NULL
>                                                                 THEN 'Y'
>                                                 ELSE 'N'
>                                                 END AS DROP_REASON
>                                 ,ROW_NUMBER() OVER (
>                                                 ORDER BY NULL
>                                                 ) AS ETL_ROW_ID
>                 FROM INTER_ETL AS SD;
>
>
>
> Table *INTER_ETL *used for query is big enough.
> From the logs , it seems that data skew for specific set of values ,
> causing one of reducer have to do all the job. I tried to achieve the same
> through GROUP BY still having the same issue.  Help me to understand the
> issue and resolution.
>
> *Query with Distinct V2 :*
>
>
>
> CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT DISTINCT dt.*
> FROM (
>         SELECT SD.REGION
>                     ,SD.HEADEND
>                     ,SD.NETWORK
>                     ,SD.RETAILUNITCODE
>                     ,SD.LOGTIMEDATE
>                     ,SD.SPOTKEY
>                     ,SD.CRE_DT
>                     ,CASE
>                                 WHEN SD.LOGTIMEDATE IS NULL
>                                             THEN 'Y'
>                                 ELSE 'N'
>                                 END AS DROP_REASON
>                     ,ROW_NUMBER() OVER (
>                                 ORDER BY NULL
>                                 ) AS ETL_ROW_ID
>         FROM INTER_ETL AS SD
>         ) AS dt;
>
>
>
> Logs:
>
>
>
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  *Reducer 2: 417(+1)/418*  Reducer 3: 0(+418)/418
>
>
>
>
>
> *Query With Group By:*
>
>
>
> CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT REGION
>                     ,HEADEND
>                     ,NETWORK
>                     ,RETAILUNITCODE
>                     ,LOGTIMEDATE
>                     ,SPOTKEY
>                     ,CRE_DT
>                     ,DROP_REASON
>                     ,ETL_ROW_ID
> FROM (
>         SELECT SD.REGION
>                     ,SD.HEADEND
>                     ,SD.NETWORK
>                     ,SD.RETAILUNITCODE
>                     ,SD.LOGTIMEDATE
>                     ,SD.SPOTKEY
>                     ,SD.CRE_DT
>                     ,CASE
>                                 WHEN SD.LOGTIMEDATE IS NULL
>                                             THEN 'Y'
>                                 ELSE 'N'
>                                 END AS DROP_REASON
>                     ,ROW_NUMBER() OVER (
>                                 ORDER BY NULL
>                                 ) AS ETL_ROW_ID
>         FROM INTER_ETL AS SD
>         ) AS dt
> GROUP BY
>          REGION
>                     ,HEADEND
>                     ,NETWORK
>                     ,RETAILUNITCODE
>                     ,LOGTIMEDATE
>                     ,SPOTKEY
>                     ,CRE_DT
>                     ,DROP_REASON
>                     ,ETL_ROW_ID;
>
>
>
> Logs:
>
>
>
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  *Reducer 2: 417(+1)/418*  Reducer 3: 0(+418)/418
>
>
>
> *Table details :*
>
>
>
> Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
> |                                                                 DFS
> Output                                                                 |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
> | Found 15 items
>                                                                   |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2075244899 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2090030620 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2025516774 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1986848213 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2018883723 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1984690335 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1987494444 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1974674515 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1963720218 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1965892384 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1974272622 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1971948208 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1968141886 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1970930771 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs  192820628 2016-06-28 10:23
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0  |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
>  'numFiles'='15',
>
>
>  'numRows'='108363614',
>
>
>
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
>
>
>
>
>
>
>
>
>
>

Re: Query Performance Issue : Group By and Distinct and load on reducer

Reply via email to