RE: Reg:Column Statistics with Parquet

Navdeep Agrawal Fri, 25 Jul 2014 00:19:28 -0700

Well not the correct way ,you can check the statistics in mysql part_col_stats 
like tables in mysql data base if you are using mysql stat database .
Or the other way is calling max,min,distinct on int columns ,largest length on 
string columns etc,if they run whole map reduce on these operation then 
statistics are not getting created .

From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
Sent: Friday, July 25, 2014 12:43 PM
To: user@hive.apache.org
Subject: Re: Reg:Column Statistics with Parquet

Hi ,

I tried the same with compute statistics for columns a, b,c as above and still 
seeing the same results in explain plan.

How do I confirm if its generating all the column stats for a given column. If 
this is confirmed, we can debug why Hive is still not using it?

Thanks
Suma

On Thu, Jul 24, 2014 at 11:49 PM, Prasanth Jayachandran 
<pjayachand...@hortonworks.com<mailto:pjayachand...@hortonworks.com>> wrote:
You have to explicit specifics column list in analyze command for gathering 
columns stats.

This command will only collect basic stats like number of rows, total file 
size, raw data size, number of files.
analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics;

To collect column statistics add the column list like below
analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics for columns a, b, c;

Thanks
Prasanth Jayachandran

On Jul 24, 2014, at 5:13 AM, Sandeep Samudrala 
<sandeep.samudr...@inmobi.com<mailto:sandeep.samudr...@inmobi.com>> wrote:

I am trying to enable Column statistics usage with Parquet tables. This is the 
query I am executing. However on explain, I see that even though Basic stats: 
COMPLETE is seen Column stats is seen asNONE.
Can someone please explain what else I need to debug/fix this.

set hive.compute.query.using.stats=true;
set hive.stats.reliable=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.cbo.enable=true;

analyze table user_table partition(dt='2014-06-01',hour='00') compute 
statistics;

explain select min(a), max(b), min(c) from user_table;

hive> explain select min(a), max(b), min(c) from usertable;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_table
            Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: 
COMPLETE Column stats: NONE
            Select Operator
              expressions: a (type: double), b (type: double), c (type: int)
              outputColumnNames: a, b, c
              Statistics: Num rows: 55490383 Data size: 1831182639 Basic stats: 
COMPLETE Column stats: NONE
              Group By Operator
                aggregations: min(a), max(b), min(c)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
                  value expressions: _col0 (type: double), _col1 (type: 
double), _col2 (type: int)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0), max(VALUE._col1), min(VALUE._col2)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: double), _col1 (type: double), _col2 
(type: int)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE Column 
stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 20 Basic stats: COMPLETE 
Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Thanks,
-sandeep

_____________________________________________________________
The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and others authorized to 
receive it. It may contain confidential or legally privileged information. If 
you are not the intended recipient you are hereby notified that any disclosure, 
copying, distribution or taking any action in reliance on the contents of this 
information is strictly prohibited and may be unlawful. If you have received 
this communication in error, please notify us immediately by responding to this 
email and then delete it from your system. The firm is neither liable for the 
proper and complete transmission of the information contained in this 
communication nor for any delay in its receipt.

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

RE: Reg:Column Statistics with Parquet

Reply via email to