Re: column statistics for non-primitive types

Michael Häusler Tue, 14 Jun 2016 13:03:55 -0700

Hi there,

there might be two topics here:


1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of 
types:
numeric values: min, max, #nulls, #distincts
boolean values: #nulls, #trues, #falses
string values:          #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for 
top-level non-primitive columns, e.g.:
array values:           #nulls, #distincts, avgLength, maxLength 
map values:             #nulls, #distincts, avgLength, maxLength
struct values:          #nulls, #distincts
union values:           #nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the 
convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo 
COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column 
statistics for hive users.

Best regards
Michael



> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi Michael,
> 
> Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.
> 
> When I am looking at this table in Oracle, I only see statistics for 
> primitives columns here. STRUCT columns do not have it as a STRUCT column 
> will have to be broken into its primitive columns.  I don't think Hive has 
> the means to do that.
> 
> desc tab_col_stats;
>  Name                                                                     
> Null?    Type
>  ------------------------------------------------------------------------ 
> -------- -------------------------------------------------
>  CS_ID                                                                    NOT 
> NULL NUMBER
>  DB_NAME                                                                  NOT 
> NULL VARCHAR2(128)
>  TABLE_NAME                                                               NOT 
> NULL VARCHAR2(128)
>  COLUMN_NAME                                                              NOT 
> NULL VARCHAR2(1000)
>  COLUMN_TYPE                                                              NOT 
> NULL VARCHAR2(128)
>  TBL_ID                                                                   NOT 
> NULL NUMBER
>  LONG_LOW_VALUE                                                               
>      NUMBER
>  LONG_HIGH_VALUE                                                              
>      NUMBER
>  DOUBLE_LOW_VALUE                                                             
>      NUMBER
>  DOUBLE_HIGH_VALUE                                                            
>      NUMBER
>  BIG_DECIMAL_LOW_VALUE                                                        
>      VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE                                                       
>      VARCHAR2(4000)
>  NUM_NULLS                                                                NOT 
> NULL NUMBER
>  NUM_DISTINCTS                                                                
>      NUMBER
>  AVG_COL_LEN                                                                  
>      NUMBER
>  MAX_COL_LEN                                                                  
>      NUMBER
>  NUM_TRUES                                                                    
>      NUMBER
>  NUM_FALSES                                                                   
>      NUMBER
>  LAST_ANALYZED                                                            NOT 
> NULL NUMBER
> 
> 
> 
>  So in summary although column type STRUCT do exit, I don't think Hive can 
> cater for their statistics. Actually I don't think Oracle itself does it.
> 
> HTH
> 
> P.S. I am on Hive 2 and it does not.
> 
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
> but array<bigint> is passed.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de 
> <mailto:mich...@akatose.de>> wrote:
> Hi there,
> 
> you can reproduce the messages below with Hive 1.2.1.
> 
> Best regards
> Michael
> 
> 
>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> which version of Hive are you using?
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de 
>> <mailto:mich...@akatose.de>> wrote:
>> Hi there,
>> 
>> 
>> when testing column statistics I stumbled upon the following error message:
>> 
>> DROP TABLE IF EXISTS foo;
>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar 
>> STRUCT<key:STRING,value:STRING>);
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
>> but array<bigint> is passed.
>> 
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are accepted 
>> but struct<key:string,value:string> is passed.
>> 
>> 
>> 1) Basically, it seems that column statistics don't work for non-primitive 
>> types. Are there any workarounds or any plans to change this?
>> 
>> 2) Furthermore, the convenience syntax to compute statistics for all columns 
>> does not work as soon as there is a non-supported column. Are there any 
>> plans to change this, so it is easier to compute statistics for all 
>> supported columns?
>> 
>> 3) ANALYZE TABLE will only provide the first failing *type* in the error 
>> message. Especially for wide tables it would be much easier if all 
>> non-supported column *names* would be printed.
>> 
>> 
>> Best regards
>> Michael
>> 
>> 
> 
>

Re: column statistics for non-primitive types

Reply via email to