RE: column statistics for non-primitive types

Markovitz, Dudu Wed, 15 Jun 2016 05:30:33 -0700

Hi Michael

Case ‘b’ (“answer query directly”) seems to be risky in an open system.
Files/directories can be deleted directly in the filesystem without Hive having 
any knowledge about it which will lead to wrong queries results.


Dudu

From: Michael Häusler [mailto:mich...@akatose.de]
Sent: Tuesday, June 14, 2016 11:43 PM
To: user@hive.apache.org
Subject: Re: column statistics for non-primitive types

Hi Pengcheng,

(1)
statistics on non-primitive columns can be just as useful as on primitive 
columns, e.g.,
DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id BIGINT, someArray ARRAY<BIGINT>, someStruct 
STRUCT<key:STRING,value:STRING>);

a) query optimization
Let foo be a huge table that needs to be joined with another huge table bar 
like this

SELECT
            f.id
FROM
            foo f
JOIN
            bar b
ON
            f.id = b.id
WHERE
            f.someArray IS NOT NULL

If statistics tell us that #nulls in someArray is small, we could apply a 
different join strategy (e.g., map-side join, bar main table, filtered foo as 
hash table)

b) answer query directly

SELECT
            COUNT(DISTINCT someStruct)
FROM
            foo;

Such a query can easily be answered directly from stats.



(2)

Do you happen to know, whether HIVE-11160 also works for CTAS?
Because a quick test of the configuration property did not work for me:

hive> SET hive.stats.fetch.column.stats=true;
hive> DROP TABLE IF EXISTS foo;
OK
Time taken: 6.585 seconds
hive> CREATE TABLE foo AS
    > SELECT
    >     1 AS foo;
Query ID = haeusler_20160614203002_7a47459d-349b-4012-ac7f-b2cc867b87ef
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id 
application_1465334589772_15920)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 2.89 s
--------------------------------------------------------------------------------
Moving data to: hdfs://invcluster/user/hive/warehouse/haeusler.db/foo
Table haeusler.foo stats: [numFiles=1, numRows=1, totalSize=194, rawDataSize=4]
OK
Time taken: 8.088 seconds
hive> DESCRIBE FORMATTED foo.foo;
OK
# col_name              data_type               min                     max     
                num_nulls               distinct_count          avg_col_len     
        max_col_len             num_trues               num_falses              
comment

foo                     int                                                     
                                                                                
                                                                                
from deserializer
Time taken: 0.197 seconds, Fetched: 3 row(s)

^^^ the table creation works, but I don't get any column stats.


Best regards
Michael



On 2016-06-14, at 22:23, Pengcheng Xiong 
<pxi...@apache.org<mailto:pxi...@apache.org>> wrote:

Hi Michael,

    (1) We collect columns stats for the following purpose (a) Query 
optimization, esp. join reordering and big/small table size estimation. More 
recently, we also use it to remove filters. You can refer to Calcite rules. (b) 
Answer query directly through metaStore. You can refer to the configuration of 
HIVEOPTIMIZEMETADATAQUERIES("hive.compute.query.using.stats").

    We can do stats for non-primitive columns, but we need to know the 
motivation to do so before we do it. If you can, could you please list some?

   (2) There is a configuration "hive.stats.fetch.column.stats". If you set it 
to true, it will automatically collect column stats for you when you insert 
into/overwrite a new table. You can refer to HIVE-11160 for more details.

   Hope my answers help.

Thanks

Best.
Pengcheng


On Tue, Jun 14, 2016 at 1:03 PM, Michael Häusler 
<mich...@akatose.de<mailto:mich...@akatose.de>> wrote:
Hi there,

there might be two topics here:

1) feasibility of stats for non-primitive columns
2) ease of use


1) feasibility of stats for non-primitive columns:

Hive currently collects different kind of statistics for different kind of 
types:
numeric values: min, max, #nulls, #distincts
boolean values: #nulls, #trues, #falses
string values: #nulls, #distincts, avgLength, maxLength

So, it seems quite possible to also collect at least partial stats for 
top-level non-primitive columns, e.g.:
array values: #nulls, #distincts, avgLength, maxLength
map values: #nulls, #distincts, avgLength, maxLength
struct values: #nulls, #distincts
union values: #nulls, #distincts


2) ease of use

The presence of a single non-primitive column currently breaks the use of the 
convenience shorthand to gather statistics for all columns (ANALYZE TABLE foo 
COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption of column 
statistics for hive users.

Best regards
Michael



On 2016-06-14, at 12:04, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Hi Michael,

Statistics for columns in Hive are kept in Hive metadata table tab_col_stats.

When I am looking at this table in Oracle, I only see statistics for primitives 
columns here. STRUCT columns do not have it as a STRUCT column will have to be 
broken into its primitive columns.  I don't think Hive has the means to do that.

desc tab_col_stats;
 Name                                                                     Null? 
   Type
 ------------------------------------------------------------------------ 
-------- -------------------------------------------------
 CS_ID                                                                    NOT 
NULL NUMBER
 DB_NAME                                                                  NOT 
NULL VARCHAR2(128)
 TABLE_NAME                                                               NOT 
NULL VARCHAR2(128)
 COLUMN_NAME                                                              NOT 
NULL VARCHAR2(1000)
 COLUMN_TYPE                                                              NOT 
NULL VARCHAR2(128)
 TBL_ID                                                                   NOT 
NULL NUMBER
 LONG_LOW_VALUE                                                                 
   NUMBER
 LONG_HIGH_VALUE                                                                
   NUMBER
 DOUBLE_LOW_VALUE                                                               
   NUMBER
 DOUBLE_HIGH_VALUE                                                              
   NUMBER
 BIG_DECIMAL_LOW_VALUE                                                          
   VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE                                                         
   VARCHAR2(4000)
 NUM_NULLS                                                                NOT 
NULL NUMBER
 NUM_DISTINCTS                                                                  
   NUMBER
 AVG_COL_LEN                                                                    
   NUMBER
 MAX_COL_LEN                                                                    
   NUMBER
 NUM_TRUES                                                                      
   NUMBER
 NUM_FALSES                                                                     
   NUMBER
 LAST_ANALYZED                                                            NOT 
NULL NUMBER



 So in summary although column type STRUCT do exit, I don't think Hive can 
cater for their statistics. Actually I don't think Oracle itself does it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo compute statistics for columns;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but 
array<bigint> is passed.


Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 14 June 2016 at 09:57, Michael Häusler 
<mich...@akatose.de<mailto:mich...@akatose.de>> wrote:
Hi there,

you can reproduce the messages below with Hive 1.2.1.

Best regards
Michael


On 2016-06-13, at 22:21, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

which version of Hive are you using?

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>


On 13 June 2016 at 16:00, Michael Häusler 
<mich...@akatose.de<mailto:mich...@akatose.de>> wrote:
Hi there,


when testing column statistics I stumbled upon the following error message:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar 
STRUCT<key:STRING,value:STRING>);

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but 
array<bigint> is passed.

ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but 
struct<key:string,value:string> is passed.


1) Basically, it seems that column statistics don't work for non-primitive 
types. Are there any workarounds or any plans to change this?

2) Furthermore, the convenience syntax to compute statistics for all columns 
does not work as soon as there is a non-supported column. Are there any plans 
to change this, so it is easier to compute statistics for all supported columns?

3) ANALYZE TABLE will only provide the first failing *type* in the error 
message. Especially for wide tables it would be much easier if all 
non-supported column *names* would be printed.


Best regards
Michael

RE: column statistics for non-primitive types

Reply via email to