Hi Michael Case ‘b’ (“answer query directly”) seems to be risky in an open system. Files/directories can be deleted directly in the filesystem without Hive having any knowledge about it which will lead to wrong queries results.
Dudu From: Michael Häusler [mailto:mich...@akatose.de] Sent: Tuesday, June 14, 2016 11:43 PM To: user@hive.apache.org Subject: Re: column statistics for non-primitive types Hi Pengcheng, (1) statistics on non-primitive columns can be just as useful as on primitive columns, e.g., DROP TABLE IF EXISTS foo; CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct STRUCT<key:STRING,value:STRING>); a) query optimization Let foo be a huge table that needs to be joined with another huge table bar like this SELECT f.id FROM foo f JOIN bar b ON f.id = b.id WHERE f.someArray IS NOT NULL If statistics tell us that #nulls in someArray is small, we could apply a different join strategy (e.g., map-side join, bar main table, filtered foo as hash table) b) answer query directly SELECT COUNT(DISTINCT someStruct) FROM foo; Such a query can easily be answered directly from stats. (2) Do you happen to know, whether HIVE-11160 also works for CTAS? Because a quick test of the configuration property did not work for me: hive> SET hive.stats.fetch.column.stats=true; hive> DROP TABLE IF EXISTS foo; OK Time taken: 6.585 seconds hive> CREATE TABLE foo AS > SELECT > 1 AS foo; Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1465334589772_15920) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 2.89 s -------------------------------------------------------------------------------- Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4] OK Time taken: 8.088 seconds hive> DESCRIBE FORMATTED foo.foo; OK # col_name data_type min max num_nulls distinct_count avg_col_len max_col_len num_trues num_falses comment foo int from deserializer Time taken: 0.197 seconds, Fetched: 3 row(s) ^^^ the table creation works, but I don't get any column stats. Best regards Michael On 2016-06-14, at 22:23, Pengcheng Xiong <pxi...@apache.org<mailto:pxi...@apache.org>> wrote: Hi Michael, (1) We collect columns stats for the following purpose (a) Query optimization, esp. join reordering and big/small table size estimation. More recently, we also use it to remove filters. You can refer to Calcite rules. (b) Answer query directly through metaStore. You can refer to the configuration of HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats"). We can do stats for non-primitive columns, but we need to know the motivation to do so before we do it. If you can, could you please list some? (2) There is a configuration "hive.stats.fetch.column.stats". If you set it to true, it will automatically collect column stats for you when you insert into/overwrite a new table. You can refer to HIVE-11160 for more details. Hope my answers help. Thanks Best. Pengcheng On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler <mich...@akatose.de<mailto:mich...@akatose.de>> wrote: Hi there, there might be two topics here: 1) feasibility of stats for non-primitive columns 2) ease of use 1) feasibility of stats for non-primitive columns: Hive currently collects different kind of statistics for different kind of types: numeric values: min, max, #nulls, #distincts boolean values: #nulls, #trues, #falses string values: #nulls, #distincts, avgLength, maxLength So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.: array values: #nulls, #distincts, avgLength, maxLength map values: #nulls, #distincts, avgLength, maxLength struct values: #nulls, #distincts union values: #nulls, #distincts 2) ease of use The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users. Best regards Michael On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: Hi Michael, Statistics for columns in Hive are kept in Hive metadata table tab_col_stats. When I am looking at this table in Oracle, I only see statistics for primitives columns here. STRUCT columns do not have it as a STRUCT column will have to be broken into its primitive columns. I don't think Hive has the means to do that. desc tab_col_stats; Name Null? Type ------------------------------------------------------------------------ -------- ------------------------------------------------- CS_ID NOT NULL NUMBER DB_NAME NOT NULL VARCHAR2(128) TABLE_NAME NOT NULL VARCHAR2(128) COLUMN_NAME NOT NULL VARCHAR2(1000) COLUMN_TYPE NOT NULL VARCHAR2(128) TBL_ID NOT NULL NUMBER LONG_LOW_VALUE NUMBER LONG_HIGH_VALUE NUMBER DOUBLE_LOW_VALUE NUMBER DOUBLE_HIGH_VALUE NUMBER BIG_DECIMAL_LOW_VALUE VARCHAR2(4000) BIG_DECIMAL_HIGH_VALUE VARCHAR2(4000) NUM_NULLS NOT NULL NUMBER NUM_DISTINCTS NUMBER AVG_COL_LEN NUMBER MAX_COL_LEN NUMBER NUM_TRUES NUMBER NUM_FALSES NUMBER LAST_ANALYZED NOT NULL NUMBER So in summary although column type STRUCT do exit, I don't think Hive can cater for their statistics. Actually I don't think Oracle itself does it. HTH P.S. I am on Hive 2 and it does not. hive> analyze table foo compute statistics for columns; FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de<mailto:mich...@akatose.de>> wrote: Hi there, you can reproduce the messages below with Hive 1.2.1. Best regards Michael On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: which version of Hive are you using? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de<mailto:mich...@akatose.de>> wrote: Hi there, when testing column statistics I stumbled upon the following error message: DROP TABLE IF EXISTS foo; CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar STRUCT<key:STRING,value:STRING>); ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS; FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but array<bigint> is passed. ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar; FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but struct<key:string,value:string> is passed. 1) Basically, it seems that column statistics don't work for non-primitive types. Are there any workarounds or any plans to change this? 2) Furthermore, the convenience syntax to compute statistics for all columns does not work as soon as there is a non-supported column. Are there any plans to change this, so it is easier to compute statistics for all supported columns? 3) ANALYZE TABLE will only provide the first failing *type* in the error message. Especially for wide tables it would be much easier if all non-supported column *names* would be printed. Best regards Michael