[ 
https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122840#comment-16122840
 ] 

liyunzhang_intel commented on HIVE-17287:
-----------------------------------------

[~gopalv] or [~lirui]:  after enable "hive.optimize.ppd", the default partition 
/user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
 has been filtered, so will not load this part of data. But the group by is 
still skewed.  Modify tpcds/query67.sql to output the result of join to view 
the result of join(before group by) is skewed or not
{code}
set hive.optimize.ppd=true;
set spark.app.name="query677.ppd.true";
drop table if exists result_677;
create table result_677 stored as TEXTFILE as
select i_category
                  ,i_class
                  ,i_brand
                  ,i_product_name
                  ,d_year
                  ,d_qoy
                  ,d_moy
                  ,s_store_id
                  ,store_sales.ss_sold_date_sk
                  ,store_sales.ss_item_sk
                  ,store_sales.ss_store_sk
            from store_sales
                ,date_dim
                ,store
                ,item
       where  store_sales.ss_sold_date_sk=date_dim.d_date_sk
          and store_sales.ss_item_sk=item.i_item_sk
          and store_sales.ss_store_sk = store.s_store_sk
          and d_month_seq between 1193 and 1193+11;
{code}

the result is
{code}
hadoop fs -du -h  
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/

105.5 M  
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000000_0
46.8 M   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000001_0
4.0 M    
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000002_0
47.4 M   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000003_0
215.1 M  
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000004_0
77.7 M   
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000005_0
0        
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000006_0
0        
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000007_0
0        
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000008_0
0        
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000009_0
0        
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_10.db/result_677/000010_0
{code}

The result of join is skewed. The biggest is 215M while the smallest is 0M.  Is 
there way to make the output of join is even so that the following groupby will 
not skewed?

> HoS can not deal with skewed data group by
> ------------------------------------------
>
>                 Key: HIVE-17287
>                 URL: https://issues.apache.org/jira/browse/HIVE-17287
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>
> In 
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
>  fact table {{store_sales}} joins with small tables {{date_dim}}, 
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h 
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G   
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to 
> solve the groupby problem of skewed table?  I tried to enable 
> {{hive.groupby.skewindata}} to first divide the data more evenly then start 
> do group by. But the job still hangs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to