Re: [DISCUSS] Incremental statistics collection

Mich Talebzadeh Tue, 29 Aug 2023 08:50:26 -0700

OK, let us take a deeper look here

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*


In above, we are *explicitly grouping columns c1 and c2 together for which
we want to compute statistic*s. Additionally, we are also *computing
statistics for column c3 independen*t*ly*. This approach *allows CBO to
treat columns c1 and c2 as a group and compute joint statistics for them,
while computing separate statistics for column c3.*

If columns c1 and c2 are frequently used together in conditions, I
concur it makes sense to compute joint statistics for them by using the
above syntax. On the other hand, if each column has its own significance
and the relationship between them is not crucial, we can use

ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
*c1, c2, c3*

This syntax can be used to compute separate statistics for each column.

So your mileage varies.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 29 Aug 2023 at 12:14, Chetan <chetansuttra...@gmail.com> wrote:

> Hi,
>
> If we are taking this up, then would ask can we support multicolumn stats
> such as :
> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
> This should help in estimating better for conditions involving c1 and c2
>
> Thanks.
>
> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> short answer on top of my head
>>
>> My point was with regard to  Cost Based Optimizer (CBO) in traditional
>> databases. The concept of a rowkey in HBase is somewhat similar to that of
>> a primary key in RDBMS.
>> Now in databases with automatic deduplication features (i.e. ignore
>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>> results in only one physical entry in the database due to deduplication.
>> Therefore, the new statistical value added should be 1, reflecting the
>> distinct physical entry. If the rowkey is already present in HBase, the
>> value would indeed be 0, indicating that no new physical entry was created.
>> We need to take into account the underlying deduplication mechanism of the
>> database in use to ensure that statistical values accurately represent the
>> unique physical data entries.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote:
>>
>>> For those databases with automatic deduplication capabilities, such as
>>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>>> already contains this rowkey, the value would be 0. How should we handle
>>> this situation?
>>>
>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道：
>>>
>>>> I have never been fond of the notion that measuring inserts, updates,
>>>> and deletes (referred to as DML) is the sole criterion for signaling a
>>>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>>>> absence of an alternative mechanism, it seems this is the only approach at
>>>> our disposal (can we use AI for it 😁). Personally, I would prefer some
>>>> form of indication regarding shifts in the distribution of values in the
>>>> histogram, overall density, and similar indicators. The decision to execute
>>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>>>> column-level statistics, which is why I would tend to focus on monitoring
>>>> individual column-level statistics to detect any signals warranting a
>>>> statistics update.
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a
>>>>> case of old wine in a new bottle. The objective, I assume, is to employ
>>>>> dynamic sampling to enhance the optimizer's capacity to create effective
>>>>> execution plans without the burden of complete I/O and in less time.
>>>>>
>>>>> For instance:
>>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>>>
>>>>> This approach could potentially aid in estimating deltas by utilizing
>>>>> sampling.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <raksonrak...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I would like to propose the incremental collection of statistics in
>>>>>> spark. SPARK-44817
>>>>>> <https://issues.apache.org/jira/browse/SPARK-44817> has been raised
>>>>>> for the same.
>>>>>>
>>>>>> Currently, spark invalidates the stats after data changing commands
>>>>>> which would make CBO non-functional. To update these stats, user either
>>>>>> needs to run `ANALYZE TABLE` command or turn
>>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full 
>>>>>> table
>>>>>> scan while the other one only updates table and partition stats and can 
>>>>>> be
>>>>>> costly in certain cases.
>>>>>>
>>>>>> The goal of this proposal is to collect stats incrementally while
>>>>>> executing data changing commands by utilizing the framework introduced in
>>>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>>>>>>
>>>>>> SPIP Document has been attached along with JIRA:
>>>>>>
>>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>>>>
>>>>>> Hive also supports automatic collection of statistics to keep the
>>>>>> stats consistent.
>>>>>> I can find multiple spark JIRAs asking for the same:
>>>>>> https://issues.apache.org/jira/browse/SPARK-28872
>>>>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>>>>
>>>>>> Regards,
>>>>>> Rakesh
>>>>>>
>>>>>
>
> --
> --
> Regards,
> Chetan
>
> +353899475147
> +919665562626
>
>

Re: [DISCUSS] Incremental statistics collection

Reply via email to