Hi there, there might be two topics here:
1) feasibility of stats for non-primitive columns 2) ease of use 1) feasibility of stats for non-primitive columns: Hive currently collects different kind of statistics for different kind of types: numeric values: min, max, #nulls, #distincts boolean values: #nulls, #trues, #falses string values: #nulls, #distincts, avgLength, maxLength So, it seems quite possible to also collect at least partial stats for top-level non-primitive columns, e.g.: array values: #nulls, #distincts, avgLength, maxLength map values: #nulls, #distincts, avgLength, maxLength struct values: #nulls, #distincts union values: #nulls, #distincts 2) ease of use The presence of a single non-primitive column currently breaks the use of the convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column statistics for hive users. Best regards Michael > On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi Michael, > > Statistics for columns in Hive are kept in Hive metadata table tab_col_stats. > > When I am looking at this table in Oracle, I only see statistics for > primitives columns here. STRUCT columns do not have it as a STRUCT column > will have to be broken into its primitive columns. I don't think Hive has > the means to do that. > > desc tab_col_stats; > Name > Null? Type > ------------------------------------------------------------------------ > -------- ------------------------------------------------- > CS_ID NOT > NULL NUMBER > DB_NAME NOT > NULL VARCHAR2(128) > TABLE_NAME NOT > NULL VARCHAR2(128) > COLUMN_NAME NOT > NULL VARCHAR2(1000) > COLUMN_TYPE NOT > NULL VARCHAR2(128) > TBL_ID NOT > NULL NUMBER > LONG_LOW_VALUE > NUMBER > LONG_HIGH_VALUE > NUMBER > DOUBLE_LOW_VALUE > NUMBER > DOUBLE_HIGH_VALUE > NUMBER > BIG_DECIMAL_LOW_VALUE > VARCHAR2(4000) > BIG_DECIMAL_HIGH_VALUE > VARCHAR2(4000) > NUM_NULLS NOT > NULL NUMBER > NUM_DISTINCTS > NUMBER > AVG_COL_LEN > NUMBER > MAX_COL_LEN > NUMBER > NUM_TRUES > NUMBER > NUM_FALSES > NUMBER > LAST_ANALYZED NOT > NULL NUMBER > > > > So in summary although column type STRUCT do exit, I don't think Hive can > cater for their statistics. Actually I don't think Oracle itself does it. > > HTH > > P.S. I am on Hive 2 and it does not. > > hive> analyze table foo compute statistics for columns; > FAILED: UDFArgumentTypeException Only primitive type arguments are accepted > but array<bigint> is passed. > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de > <mailto:mich...@akatose.de>> wrote: > Hi there, > > you can reproduce the messages below with Hive 1.2.1. > > Best regards > Michael > > >> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> which version of Hive are you using? >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> >> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de >> <mailto:mich...@akatose.de>> wrote: >> Hi there, >> >> >> when testing column statistics I stumbled upon the following error message: >> >> DROP TABLE IF EXISTS foo; >> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar >> STRUCT<key:STRING,value:STRING>); >> >> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS; >> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted >> but array<bigint> is passed. >> >> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar; >> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted >> but struct<key:string,value:string> is passed. >> >> >> 1) Basically, it seems that column statistics don't work for non-primitive >> types. Are there any workarounds or any plans to change this? >> >> 2) Furthermore, the convenience syntax to compute statistics for all columns >> does not work as soon as there is a non-supported column. Are there any >> plans to change this, so it is easier to compute statistics for all >> supported columns? >> >> 3) ANALYZE TABLE will only provide the first failing *type* in the error >> message. Especially for wide tables it would be much easier if all >> non-supported column *names* would be printed. >> >> >> Best regards >> Michael >> >> > >