apologies should read "Udit" Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 30 April 2016 at 00:35, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Unit, > > *For new tables* > > Disable stats autogathering in Hive when creating a new table and > populating it > > SET hive.stats.autogather=false; > > *Already existing tables* > > As a work-around you can try this on the already existing tables by manually > alter the numRows to -1 > > ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES > ('numRows'='-1'); > > Example > > 0: jdbc:hive2://rhes564:10010/default> create table testme as select * > from sales_staging limit 1000; > > 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; > > > +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ > | col_name | > data_type | comment | > > +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ > | # col_name | > data_type | > comment | > | | > NULL | > NULL | > | prod_id | > double > | | > | cust_id | > double > | | > | time_id | > string > | | > | channel_id | > double > | | > | promo_id | > double > | | > | quantity_sold | > double > | | > | amount_sold | > double > | | > | | > NULL | > NULL | > | # Detailed Table Information | > NULL | > NULL | > | Database: | > oraclehadoop | > NULL | > | Owner: | > hduser | > NULL | > | CreateTime: | Sat Apr 30 00:31:17 BST > 2016 | NULL | > | LastAccessTime: | > UNKNOWN | > NULL | > | Retention: | > 0 | > NULL | > | Location: | > hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | > NULL | > | Table Type: | > MANAGED_TABLE | > NULL | > | Table Parameters: | > NULL | > NULL | > | | > COLUMN_STATS_ACCURATE | > {\"BASIC_STATS\":\"true\"} | > | | > numFiles | > 1 | > | | > numRows | > 1000 | > | | > rawDataSize | > 54853 | > | | > totalSize | > 55853 | > | | > transient_lastDdlTime | > 1461972677 | > | | > NULL | > NULL | > | # Storage Information | > NULL | > NULL | > | SerDe Library: | > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | > NULL | > | InputFormat: | > org.apache.hadoop.mapred.TextInputFormat | > NULL | > | OutputFormat: | > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | > NULL | > | Compressed: | > No | > NULL | > | Num Buckets: | > -1 | > NULL | > | Bucket Columns: | > [] | > NULL | > | Sort Columns: | > [] | > NULL | > | Storage Desc Params: | > NULL | > NULL | > | | > serialization.format | > 1 | > > +-------------------------------+-----------------------------------------------------------------+-----------------------------+--+ > > 0: jdbc:hive2://rhes564:10010/default>* ALTER TABLE testme SET > TBLPROPERTIES ('numRows'='-1');* > > 0: jdbc:hive2://rhes564:10010/default> desc formatted testme; > > > +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ > | col_name | > data_type | comment | > > +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ > | # col_name | > data_type | > comment | > | | > NULL | > NULL | > | prod_id | > double > | | > | cust_id | > double > | | > | time_id | > string > | | > | channel_id | > double > | | > | promo_id | > double > | | > | quantity_sold | > double > | | > | amount_sold | > double > | | > | | > NULL | > NULL | > | # Detailed Table Information | > NULL | > NULL | > | Database: | > oraclehadoop | > NULL | > | Owner: | > hduser | > NULL | > | CreateTime: | Sat Apr 30 00:31:17 BST > 2016 | NULL | > | LastAccessTime: | > UNKNOWN | > NULL | > | Retention: | > 0 | > NULL | > | Location: | > hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/testme | > NULL | > | Table Type: | > MANAGED_TABLE | > NULL | > | Table Parameters: | > NULL | > NULL | > | | > last_modified_by | > hduser | > | | > last_modified_time | > 1461973002 | > | | > numFiles | > 1 | > | | > numRows | > -1 | > | | > rawDataSize | > 54853 | > | | > totalSize | > 55853 | > | | > transient_lastDdlTime | > 1461973002 | > | | > NULL | > NULL | > | # Storage Information | > NULL | > NULL | > | SerDe Library: | > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | > NULL | > | InputFormat: | > org.apache.hadoop.mapred.TextInputFormat | > NULL | > | OutputFormat: | > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | > NULL | > | Compressed: | > No | > NULL | > | Num Buckets: | > -1 | > NULL | > | Bucket Columns: | > [] | > NULL | > | Sort Columns: | > [] | > NULL | > | Storage Desc Params: | > NULL | > NULL | > | | > serialization.format | > 1 | > > +-------------------------------+-----------------------------------------------------------------+-----------------------+--+ > > Hopefully that will turn off the autogather feature for existing tables. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 29 April 2016 at 23:32, Udit Mehta <ume...@groupon.com> wrote: > >> Hi, >> >> Thanks for the replies. >> We have a scenario where we have an ETL job inserting into a table with >> thousands of partitions using dynamic partitioning. We have certain SLA's >> within which we would like the job to finish and sometimes there are >> scenarios where they are missed (extra data or a busy cluster). I >> understand that stats are essential for Hive CBO but we are trying to >> explore how much overhead do these stats collection add to the job runtime. >> A lot of these tables are intermediary tables so having stats for them >> might not be entirely necessary. >> >> I just wanted to figure if there was a easy way to disable the stats and >> then compare the performance. >> >> Mich, can you give more information on how to disable it in the table >> struct as I cant find any documentation on it. >> >> Thanks again. >> Udit >> >> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxi...@apache.org> >> wrote: >> >>> Hi Udit, >>> >>> Could u be more specific about your problem? Like, what settings you >>> have, what query you run and what is the result and what result do you >>> expect? >>> >>> From what you said, my understanding is that, you want to wipe out >>> the basic stats for existing tables? And, could u also let us know why you >>> would like to get rid of the stats? Stats is crucial for Hive CBO to work >>> and we are moving towards the direction to make table/column stats >>> collection automatically. It seems that you prefer an opposite direction. >>> There is nothing wrong here and we would like to listen to your idea and >>> motivation so that we can better design Hive stats collection. Thanks! >>> >>> Best >>> Pengcheng >>> >>> >>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <ume...@groupon.com> wrote: >>> >>>> Any insights on this? >>>> >>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com> wrote: >>>> >>>>> Update: Realized this works if we create a fresh table with this >>>>> config already disabled but does not work if there is already a table >>>>> created when this config was enabled. We now need to figure out how to >>>>> disable this config for a table created when this config was true. >>>>> >>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> >>>>> wrote: >>>>> >>>>>> Hive version we are using is 1.2.1. >>>>>> >>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We need to disable the Hive autogather stats optimization by >>>>>>> disabling "*hive.stats.autogather*" but for some reason, the config >>>>>>> change doesnt seem to go through. We modified this config in the >>>>>>> hive-site.xml and restarted the Hive metastore. We also made this change >>>>>>> explicitly in the job but it doesnt seem to help. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *set hive.stats.autogather=false;* >>>>>>> Does anyone know the right way to disable this config since we dont >>>>>>> want to compute stats in out jobs. >>>>>>> >>>>>>> Thanks, >>>>>>> Udit >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >