Hi Unit,
*For new tables*
Disable stats autogathering in Hive when creating a new table and
populating it
SET hive.stats.autogather=false;
*Already existing tables*
As a work-around you can try this on the already existing tables by manually
alter the numRows to -1
ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
('numRows'='-1');
Example
0: jdbc:hive2://rhes564:10010/default> create table testme as select * from
sales_staging limit 1000;
0: jdbc:hive2://rhes564:10010/default> desc formatted testme;
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
| col_name |
data_type | comment |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
| # col_name |
data_type |
comment |
| |
NULL |
NULL |
| prod_id |
double
| |
| cust_id |
double
| |
| time_id |
string
| |
| channel_id |
double
| |
| promo_id |
double
| |
| quantity_sold |
double
| |
| amount_sold |
double
| |
| |
NULL |
NULL |
| # Detailed Table Information |
NULL |
NULL |
| Database: |
oraclehadoop |
NULL |
| Owner: |
hduser |
NULL |
| CreateTime: | Sat Apr 30 00:31:17 BST
2016 | NULL |
| LastAccessTime: |
UNKNOWN |
NULL |
| Retention: |
0 |
NULL |
| Location: |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme |
NULL |
| Table Type: |
MANAGED_TABLE |
NULL |
| Table Parameters: |
NULL |
NULL |
| |
COLUMN_STATS_ACCURATE |
{\"BASIC_STATS\":\"true\"} |
| |
numFiles |
1 |
| |
numRows |
1000 |
| |
rawDataSize |
54853 |
| |
totalSize |
55853 |
| |
transient_lastDdlTime |
1461972677 |
| |
NULL |
NULL |
| # Storage Information |
NULL |
NULL |
| SerDe Library: |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
NULL |
| InputFormat: |
org.apache.hadoop.mapred.TextInputFormat |
NULL |
| OutputFormat: |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
NULL |
| Compressed: |
No |
NULL |
| Num Buckets: |
-1 |
NULL |
| Bucket Columns: |
[] |
NULL |
| Sort Columns: |
[] |
NULL |
| Storage Desc Params: |
NULL |
NULL |
| |
serialization.format |
1 |
+-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme SET
TBLPROPERTIES ('numRows'='-1');*
0: jdbc:hive2://rhes564:10010/default> desc formatted testme;
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
| col_name |
data_type | comment |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
| # col_name |
data_type |
comment |
| |
NULL |
NULL |
| prod_id |
double
| |
| cust_id |
double
| |
| time_id |
string
| |
| channel_id |
double
| |
| promo_id |
double
| |
| quantity_sold |
double
| |
| amount_sold |
double
| |
| |
NULL |
NULL |
| # Detailed Table Information |
NULL |
NULL |
| Database: |
oraclehadoop |
NULL |
| Owner: |
hduser |
NULL |
| CreateTime: | Sat Apr 30 00:31:17 BST
2016 | NULL |
| LastAccessTime: |
UNKNOWN |
NULL |
| Retention: |
0 |
NULL |
| Location: |
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme |
NULL |
| Table Type: |
MANAGED_TABLE |
NULL |
| Table Parameters: |
NULL |
NULL |
| |
last_modified_by |
hduser |
| |
last_modified_time |
1461973002 |
| |
numFiles |
1 |
| |
numRows |
-1 |
| |
rawDataSize |
54853 |
| |
totalSize |
55853 |
| |
transient_lastDdlTime |
1461973002 |
| |
NULL |
NULL |
| # Storage Information |
NULL |
NULL |
| SerDe Library: |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
NULL |
| InputFormat: |
org.apache.hadoop.mapred.TextInputFormat |
NULL |
| OutputFormat: |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
NULL |
| Compressed: |
No |
NULL |
| Num Buckets: |
-1 |
NULL |
| Bucket Columns: |
[] |
NULL |
| Sort Columns: |
[] |
NULL |
| Storage Desc Params: |
NULL |
NULL |
| |
serialization.format |
1 |
+-------------------------------+-----------------------------------------------------------------+-----------------------+--+
Hopefully that will turn off the autogather feature for existing tables.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 29 April 2016 at 23:32, Udit Mehta <[email protected]> wrote:
> Hi,
>
> Thanks for the replies.
> We have a scenario where we have an ETL job inserting into a table with
> thousands of partitions using dynamic partitioning. We have certain SLA's
> within which we would like the job to finish and sometimes there are
> scenarios where they are missed (extra data or a busy cluster). I
> understand that stats are essential for Hive CBO but we are trying to
> explore how much overhead do these stats collection add to the job runtime.
> A lot of these tables are intermediary tables so having stats for them
> might not be entirely necessary.
>
> I just wanted to figure if there was a easy way to disable the stats and
> then compare the performance.
>
> Mich, can you give more information on how to disable it in the table
> struct as I cant find any documentation on it.
>
> Thanks again.
> Udit
>
> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <[email protected]>
> wrote:
>
>> Hi Udit,
>>
>> Could u be more specific about your problem? Like, what settings you
>> have, what query you run and what is the result and what result do you
>> expect?
>>
>> From what you said, my understanding is that, you want to wipe out
>> the basic stats for existing tables? And, could u also let us know why you
>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>> and we are moving towards the direction to make table/column stats
>> collection automatically. It seems that you prefer an opposite direction.
>> There is nothing wrong here and we would like to listen to your idea and
>> motivation so that we can better design Hive stats collection. Thanks!
>>
>> Best
>> Pengcheng
>>
>>
>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <[email protected]> wrote:
>>
>>> Any insights on this?
>>>
>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <[email protected]> wrote:
>>>
>>>> Update: Realized this works if we create a fresh table with this config
>>>> already disabled but does not work if there is already a table created when
>>>> this config was enabled. We now need to figure out how to disable this
>>>> config for a table created when this config was true.
>>>>
>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <[email protected]> wrote:
>>>>
>>>>> Hive version we are using is 1.2.1.
>>>>>
>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config
>>>>>> change doesnt seem to go through. We modified this config in the
>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change
>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *set hive.stats.autogather=false;*
>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>> want to compute stats in out jobs.
>>>>>>
>>>>>> Thanks,
>>>>>> Udit
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>