Re: Disable Hive autogather optimization

Mich Talebzadeh Fri, 29 Apr 2016 16:43:37 -0700

apologies should read "Udit"

Dr Mich Talebzadeh




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 30 April 2016 at 00:35, Mich Talebzadeh <[email protected]>
wrote:

> Hi Unit,
>
> *For new tables*
>
> Disable stats autogathering in Hive when creating a new table  and
> populating it
>
> SET hive.stats.autogather=false;
>
> *Already existing tables*
>
> As a work-around you can try this on the already existing tables  by manually
> alter the numRows to -1
>
> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES
> ('numRows'='-1');
>
> Example
>
> 0: jdbc:hive2://rhes564:10010/default> create table testme as select *
> from sales_staging limit 1000;
>
> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
> |           col_name            |
> data_type                            |           comment           |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
> | # col_name                    |
> data_type                                                       |
> comment                     |
> |                               |
> NULL                                                            |
> NULL                        |
> | prod_id                       |
> double
> |                             |
> | cust_id                       |
> double
> |                             |
> | time_id                       |
> string
> |                             |
> | channel_id                    |
> double
> |                             |
> | promo_id                      |
> double
> |                             |
> | quantity_sold                 |
> double
> |                             |
> | amount_sold                   |
> double
> |                             |
> |                               |
> NULL                                                            |
> NULL                        |
> | # Detailed Table Information  |
> NULL                                                            |
> NULL                        |
> | Database:                     |
> oraclehadoop                                                    |
> NULL                        |
> | Owner:                        |
> hduser                                                          |
> NULL                        |
> | CreateTime:                   | Sat Apr 30 00:31:17 BST
> 2016                                    | NULL                        |
> | LastAccessTime:               |
> UNKNOWN                                                         |
> NULL                        |
> | Retention:                    |
> 0                                                               |
> NULL                        |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
> NULL                        |
> | Table Type:                   |
> MANAGED_TABLE                                                   |
> NULL                        |
> | Table Parameters:             |
> NULL                                                            |
> NULL                        |
> |                               |
> COLUMN_STATS_ACCURATE                                           |
> {\"BASIC_STATS\":\"true\"}  |
> |                               |
> numFiles                                                        |
> 1                           |
> |                               |
> numRows                                                         |
> 1000                        |
> |                               |
> rawDataSize                                                     |
> 54853                       |
> |                               |
> totalSize                                                       |
> 55853                       |
> |                               |
> transient_lastDdlTime                                           |
> 1461972677                  |
> |                               |
> NULL                                                            |
> NULL                        |
> | # Storage Information         |
> NULL                                                            |
> NULL                        |
> | SerDe Library:                |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
> NULL                        |
> | InputFormat:                  |
> org.apache.hadoop.mapred.TextInputFormat                        |
> NULL                        |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
> NULL                        |
> | Compressed:                   |
> No                                                              |
> NULL                        |
> | Num Buckets:                  |
> -1                                                              |
> NULL                        |
> | Bucket Columns:               |
> []                                                              |
> NULL                        |
> | Sort Columns:                 |
> []                                                              |
> NULL                        |
> | Storage Desc Params:          |
> NULL                                                            |
> NULL                        |
> |                               |
> serialization.format                                            |
> 1                           |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+
>
> 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme  SET
> TBLPROPERTIES ('numRows'='-1');*
>
> 0: jdbc:hive2://rhes564:10010/default> desc formatted  testme;
>
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
> |           col_name            |
> data_type                            |        comment        |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
> | # col_name                    |
> data_type                                                       |
> comment               |
> |                               |
> NULL                                                            |
> NULL                  |
> | prod_id                       |
> double
> |                       |
> | cust_id                       |
> double
> |                       |
> | time_id                       |
> string
> |                       |
> | channel_id                    |
> double
> |                       |
> | promo_id                      |
> double
> |                       |
> | quantity_sold                 |
> double
> |                       |
> | amount_sold                   |
> double
> |                       |
> |                               |
> NULL                                                            |
> NULL                  |
> | # Detailed Table Information  |
> NULL                                                            |
> NULL                  |
> | Database:                     |
> oraclehadoop                                                    |
> NULL                  |
> | Owner:                        |
> hduser                                                          |
> NULL                  |
> | CreateTime:                   | Sat Apr 30 00:31:17 BST
> 2016                                    | NULL                  |
> | LastAccessTime:               |
> UNKNOWN                                                         |
> NULL                  |
> | Retention:                    |
> 0                                                               |
> NULL                  |
> | Location:                     |
> hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme  |
> NULL                  |
> | Table Type:                   |
> MANAGED_TABLE                                                   |
> NULL                  |
> | Table Parameters:             |
> NULL                                                            |
> NULL                  |
> |                               |
> last_modified_by                                                |
> hduser                |
> |                               |
> last_modified_time                                              |
> 1461973002            |
> |                               |
> numFiles                                                        |
> 1                     |
> |                               |
> numRows                                                         |
> -1                    |
> |                               |
> rawDataSize                                                     |
> 54853                 |
> |                               |
> totalSize                                                       |
> 55853                 |
> |                               |
> transient_lastDdlTime                                           |
> 1461973002            |
> |                               |
> NULL                                                            |
> NULL                  |
> | # Storage Information         |
> NULL                                                            |
> NULL                  |
> | SerDe Library:                |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              |
> NULL                  |
> | InputFormat:                  |
> org.apache.hadoop.mapred.TextInputFormat                        |
> NULL                  |
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat      |
> NULL                  |
> | Compressed:                   |
> No                                                              |
> NULL                  |
> | Num Buckets:                  |
> -1                                                              |
> NULL                  |
> | Bucket Columns:               |
> []                                                              |
> NULL                  |
> | Sort Columns:                 |
> []                                                              |
> NULL                  |
> | Storage Desc Params:          |
> NULL                                                            |
> NULL                  |
> |                               |
> serialization.format                                            |
> 1                     |
>
> +-------------------------------+-----------------------------------------------------------------+-----------------------+--+
>
> Hopefully that will turn off the autogather feature for existing tables.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 April 2016 at 23:32, Udit Mehta <[email protected]> wrote:
>
>> Hi,
>>
>> Thanks for the replies.
>> We have a scenario where we have an ETL job inserting into a table with
>> thousands of partitions using dynamic partitioning. We have certain SLA's
>> within which we would like the job to finish and sometimes there are
>> scenarios where they are missed (extra data or a busy cluster). I
>> understand that stats are essential for Hive CBO but we are trying to
>> explore how much overhead do these stats collection add to the job runtime.
>> A lot of these tables are intermediary tables so having stats for them
>> might not be entirely necessary.
>>
>> I just wanted to figure if there was a easy way to disable the stats and
>> then compare the performance.
>>
>> Mich, can you give more information on how to disable it in the table
>> struct as I cant find any documentation on it.
>>
>> Thanks again.
>> Udit
>>
>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <[email protected]>
>> wrote:
>>
>>> Hi Udit,
>>>
>>>     Could u be more specific about your problem? Like, what settings you
>>> have, what query you run and what is the result and what result do you
>>> expect?
>>>
>>>     From what you said, my understanding is that, you want to wipe out
>>> the basic stats for existing tables? And, could u also let us know why you
>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>>> and we are moving towards the direction to make table/column stats
>>> collection automatically. It seems that you prefer an opposite direction.
>>> There is nothing wrong here and we would like to listen to your idea and
>>> motivation so that we can better design Hive stats collection. Thanks!
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <[email protected]> wrote:
>>>
>>>> Any insights on this?
>>>>
>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <[email protected]> wrote:
>>>>
>>>>> Update: Realized this works if we create a fresh table with this
>>>>> config already disabled but does not work if there is already a table
>>>>> created when this config was enabled. We now need to figure out how to
>>>>> disable this config for a table created when this config was true.
>>>>>
>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hive version we are using is 1.2.1.
>>>>>>
>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config
>>>>>>> change doesnt seem to go through. We modified this config in the
>>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change
>>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *set hive.stats.autogather=false;*
>>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>>> want to compute stats in out jobs.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Udit
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Disable Hive autogather optimization

Reply via email to