Re: column statistics for non-primitive types

Mich Talebzadeh Tue, 14 Jun 2016 14:32:28 -0700

Hi,

there is another approach to reduce time for analyzing stats and that is
sampling, i.e. looking at the fraction of data. For example in Oracle one
can do that


EXEC 
DBMS_STATS.GATHER_TABLE_STATS(ownname=>user,tabname=>'T1',*estimate_percent=>
50%*)

In general Hive statistics is pretty straight forward. This is because
unless a table is ORC and transactional then update/deletes won't happen.
Only new inserts for most of data (immutable).

My opinion on this is also mixed. Does not apply to Hive. Most RDNMS
provide something like datachange() function that recommends analysing
table if the underlying table size is changed.

I gather with Hive any new insert will trigger an automatic analyze stats.
new data coming in to an existing table with data does not imply lack of
quality statistics. If the distribution remain more and less the same for a
given column, updating statistics is not going to make that much
difference. I would rather spend more time on making external indexes
useful for the optimizer.


 HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:05, Pengcheng Xiong <pxi...@apache.org> wrote:

> Hi Mich,
>
>     I agree with you that column stats gathering in Hive is not cheap and
> comes with overheads. This is due to the large volume of data that Hive has
> to process. However, this is the price you have to pay anyway even with
> current "analyze table" solution.
>
>    The new feature not only provides a way to make users have column stats
> automatically, but also saves overhead for the "insert into" case. In this
> case, the new column stats are generated incrementally, i.e., by merging
> with the existing stats. Without this feature, you have to scan the whole
> table and compute stats.
>
>    In conclusion, this new feature should not have any more overhead than
> the current solution.
>
> Best
> Pengcheng
>
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hi,
>>
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>> it to true, it will automatically collect column stats for you when you
>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>> details.
>>
>> Not without its overheads.
>>
>> Automatic gather stats is not new. Has been around for a good time in
>> RDBMS and can impact the performance of other queries running. So I am not
>> sure it can be considered as blessing.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:25, Pengcheng Xiong <pxi...@apache.org> wrote:
>>
>>> Exactly, "the useful or meaningful these statistics is going to be" (The
>>> motivation behind).
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> My point was we are where we are and in this juncture there is no
>>>> collection of statistics for complex columns. That may be a future
>>>> enhancement.
>>>>
>>>> But then the obvious question is how useful or meaningful these
>>>> statistics is going to be?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 21:03, Michael Häusler <mich...@akatose.de> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there might be two topics here:
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns
>>>>> 2) ease of use
>>>>>
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>
>>>>> Hive currently collects different kind of statistics for different
>>>>> kind of types:
>>>>> numeric values: min, max, #nulls, #distincts
>>>>> boolean values: #nulls, #trues, #falses
>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>
>>>>> So, it seems quite possible to also collect at least partial stats for
>>>>> top-level non-primitive columns, e.g.:
>>>>> array values: #nulls, #distincts, avgLength, maxLength
>>>>> map values: #nulls, #distincts, avgLength, maxLength
>>>>> struct values: #nulls, #distincts
>>>>> union values: #nulls, #distincts
>>>>>
>>>>>
>>>>> 2) ease of use
>>>>>
>>>>> The presence of a single non-primitive column currently breaks the use
>>>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>>>> of column statistics for hive users.
>>>>>
>>>>> Best regards
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On 2016-06-14, at 12:04, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Statistics for columns in Hive are kept in Hive metadata table
>>>>> tab_col_stats.
>>>>>
>>>>> When I am looking at this table in Oracle, I only see statistics for
>>>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>>>> will have to be broken into its primitive columns.  I don't think Hive has
>>>>> the means to do that.
>>>>>
>>>>> desc tab_col_stats;
>>>>>  Name
>>>>> Null?    Type
>>>>>  ------------------------------------------------------------------------
>>>>> -------- -------------------------------------------------
>>>>>  CS_ID
>>>>> NOT NULL NUMBER
>>>>>  DB_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TABLE_NAME
>>>>> NOT NULL VARCHAR2(128)
>>>>>  COLUMN_NAME
>>>>> NOT NULL VARCHAR2(1000)
>>>>>  COLUMN_TYPE
>>>>> NOT NULL VARCHAR2(128)
>>>>>  TBL_ID
>>>>> NOT NULL NUMBER
>>>>>  LONG_LOW_VALUE
>>>>> NUMBER
>>>>>  LONG_HIGH_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_LOW_VALUE
>>>>> NUMBER
>>>>>  DOUBLE_HIGH_VALUE
>>>>> NUMBER
>>>>>  BIG_DECIMAL_LOW_VALUE
>>>>> VARCHAR2(4000)
>>>>>  BIG_DECIMAL_HIGH_VALUE
>>>>> VARCHAR2(4000)
>>>>>  NUM_NULLS
>>>>> NOT NULL NUMBER
>>>>>  NUM_DISTINCTS
>>>>> NUMBER
>>>>>  AVG_COL_LEN
>>>>> NUMBER
>>>>>  MAX_COL_LEN
>>>>> NUMBER
>>>>>  NUM_TRUES
>>>>> NUMBER
>>>>>  NUM_FALSES
>>>>> NUMBER
>>>>>  LAST_ANALYZED
>>>>> NOT NULL NUMBER
>>>>>
>>>>>
>>>>>
>>>>>  So in summary although column type STRUCT do exit, I don't think Hive
>>>>> can cater for their statistics. Actually I don't think Oracle itself does
>>>>> it.
>>>>>
>>>>> HTH
>>>>>
>>>>> P.S. I am on Hive 2 and it does not.
>>>>>
>>>>> hive> analyze table foo compute statistics for columns;
>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>> accepted but array<bigint> is passed.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 14 June 2016 at 09:57, Michael Häusler <mich...@akatose.de> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> you can reproduce the messages below with Hive 1.2.1.
>>>>>>
>>>>>> Best regards
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>> On 2016-06-13, at 22:21, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> which version of Hive are you using?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 13 June 2016 at 16:00, Michael Häusler <mich...@akatose.de> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>>
>>>>>>> when testing column statistics I stumbled upon the following error
>>>>>>> message:
>>>>>>>
>>>>>>> DROP TABLE IF EXISTS foo;
>>>>>>> CREATE TABLE foo (foo BIGINT, bar ARRAY<BIGINT>, foobar
>>>>>>> STRUCT<key:STRING,value:STRING>);
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but array<bigint> is passed.
>>>>>>>
>>>>>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>>>>>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>>>>>> accepted but struct<key:string,value:string> is passed.
>>>>>>>
>>>>>>>
>>>>>>> 1) Basically, it seems that column statistics don't work for
>>>>>>> non-primitive types. Are there any workarounds or any plans to change 
>>>>>>> this?
>>>>>>>
>>>>>>> 2) Furthermore, the convenience syntax to compute statistics for all
>>>>>>> columns does not work as soon as there is a non-supported column. Are 
>>>>>>> there
>>>>>>> any plans to change this, so it is easier to compute statistics for all
>>>>>>> supported columns?
>>>>>>>
>>>>>>> 3) ANALYZE TABLE will only provide the first failing *type* in the
>>>>>>> error message. Especially for wide tables it would be much easier if all
>>>>>>> non-supported column *names* would be printed.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: column statistics for non-primitive types

Reply via email to