[ 
https://issues.apache.org/jira/browse/HIVE-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated HIVE-17182:
------------------------------------
    Description: 
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name              data_type               comment             
                 
ss_sold_time_sk         bigint                                      
ss_item_sk              bigint                                      
ss_customer_sk          bigint                                      
ss_cdemo_sk             bigint                                      
ss_hdemo_sk             bigint                                      
ss_addr_sk              bigint                                      
ss_store_sk             bigint                                      
ss_promo_sk             bigint                                      
ss_ticket_number        bigint                                      
ss_quantity             int                                         
ss_wholesale_cost       double                                      
ss_list_price           double                                      
ss_sales_price          double                                      
ss_ext_discount_amt     double                                      
ss_ext_sales_price      double                                      
ss_ext_wholesale_cost   double                                      
ss_ext_list_price       double                                      
ss_ext_tax              double                                      
ss_coupon_amt           double                                      
ss_net_paid             double                                      
ss_net_paid_inc_tax     double                                      
ss_net_profit           double                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
ss_sold_date_sk         bigint                                      
                 
# Detailed Table Information             
Database:               tpcds_bin_partitioned_parquet_200        
Owner:                  root                     
CreateTime:             Tue Jun 06 11:51:48 CST 2017     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
  
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                2023                
        numPartitions           1824                
        numRows                 575995635           
        rawDataSize             12671903970         
        totalSize               46465926745         
        transient_lastDdlTime   1496721108          
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{noformat}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{noformat} 
view the parquet file on hdfs
{noformat}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{noformat}

It seems that the rawDataSize is nearly 75G but in "describe formatted 
store_sales" command, it shows only 12G.


  was:
on TPC-DS 200g scale store_sales
use "describe formatted store_sales" to view the statistics
{code}
hive> describe formatted store_sales;
OK
# col_name              data_type               comment             
                 
ss_sold_time_sk         bigint                                      
ss_item_sk              bigint                                      
ss_customer_sk          bigint                                      
ss_cdemo_sk             bigint                                      
ss_hdemo_sk             bigint                                      
ss_addr_sk              bigint                                      
ss_store_sk             bigint                                      
ss_promo_sk             bigint                                      
ss_ticket_number        bigint                                      
ss_quantity             int                                         
ss_wholesale_cost       double                                      
ss_list_price           double                                      
ss_sales_price          double                                      
ss_ext_discount_amt     double                                      
ss_ext_sales_price      double                                      
ss_ext_wholesale_cost   double                                      
ss_ext_list_price       double                                      
ss_ext_tax              double                                      
ss_coupon_amt           double                                      
ss_net_paid             double                                      
ss_net_paid_inc_tax     double                                      
ss_net_profit           double                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
ss_sold_date_sk         bigint                                      
                 
# Detailed Table Information             
Database:               tpcds_bin_partitioned_parquet_200        
Owner:                  root                     
CreateTime:             Tue Jun 06 11:51:48 CST 2017     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               
hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
  
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                2023                
        numPartitions           1824                
        numRows                 575995635           
        rawDataSize             12671903970         
        totalSize               46465926745         
        transient_lastDdlTime   1496721108          
{code}
the rawDataSize is nearly 12G while the totalSize is nearly 46G.
view the original data on hdfs
{format}
#hadoop fs -du -h /tmp/tpcds-generate/200/
75.8 G   /tmp/tpcds-generate/200/store_sales
{format} 
view the parquet file on hdfs
{format}
# hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
{format}

It seems that the rawDataSize is nearly 75G but in "describe formatted 
store_sales" command, it shows only 12G.



> Invalid statistics like "RAW DATA SIZE" info for parquet file
> -------------------------------------------------------------
>
>                 Key: HIVE-17182
>                 URL: https://issues.apache.org/jira/browse/HIVE-17182
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> on TPC-DS 200g scale store_sales
> use "describe formatted store_sales" to view the statistics
> {code}
> hive> describe formatted store_sales;
> OK
> # col_name                    data_type               comment             
>                
> ss_sold_time_sk       bigint                                      
> ss_item_sk            bigint                                      
> ss_customer_sk        bigint                                      
> ss_cdemo_sk           bigint                                      
> ss_hdemo_sk           bigint                                      
> ss_addr_sk            bigint                                      
> ss_store_sk           bigint                                      
> ss_promo_sk           bigint                                      
> ss_ticket_number      bigint                                      
> ss_quantity           int                                         
> ss_wholesale_cost     double                                      
> ss_list_price         double                                      
> ss_sales_price        double                                      
> ss_ext_discount_amt   double                                      
> ss_ext_sales_price    double                                      
> ss_ext_wholesale_cost double                                      
> ss_ext_list_price     double                                      
> ss_ext_tax            double                                      
> ss_coupon_amt         double                                      
> ss_net_paid           double                                      
> ss_net_paid_inc_tax   double                                      
> ss_net_profit         double                                      
>                
> # Partition Information                
> # col_name                    data_type               comment             
>                
> ss_sold_date_sk       bigint                                      
>                
> # Detailed Table Information           
> Database:             tpcds_bin_partitioned_parquet_200        
> Owner:                root                     
> CreateTime:           Tue Jun 06 11:51:48 CST 2017     
> LastAccessTime:       UNKNOWN                  
> Retention:            0                        
> Location:             
> hdfs://bdpe38:9000/user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
>   
> Table Type:           MANAGED_TABLE            
> Table Parameters:              
>       COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
>       numFiles                2023                
>       numPartitions           1824                
>       numRows                 575995635           
>       rawDataSize             12671903970         
>       totalSize               46465926745         
>       transient_lastDdlTime   1496721108          
> {code}
> the rawDataSize is nearly 12G while the totalSize is nearly 46G.
> view the original data on hdfs
> {noformat}
> #hadoop fs -du -h /tmp/tpcds-generate/200/
> 75.8 G   /tmp/tpcds-generate/200/store_sales
> {noformat} 
> view the parquet file on hdfs
> {noformat}
> # hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db
> 43.3 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_200.db/store_sales
> {noformat}
> It seems that the rawDataSize is nearly 75G but in "describe formatted 
> store_sales" command, it shows only 12G.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to