Re: [DISCUSS] Incremental statistics collection

Chetan Tue, 29 Aug 2023 16:50:13 -0700

Thanks for the detailed explanation.


Regards,
Chetan



On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> OK, let us take a deeper look here
>
> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3*
>
> In above, we are *explicitly grouping columns c1 and c2 together for
> which we want to compute statistic*s. Additionally, we are also *computing
> statistics for column c3 independen*t*ly*. This approach *allows CBO to
> treat columns c1 and c2 as a group and compute joint statistics for them,
> while computing separate statistics for column c3.*
>
> If columns c1 and c2 are frequently used together in conditions, I
> concur it makes sense to compute joint statistics for them by using the
> above syntax. On the other hand, if each column has its own significance
> and the relationship between them is not crucial, we can use
>
> ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS
> *c1, c2, c3*
>
> This syntax can be used to compute separate statistics for each column.
>
> So your mileage varies.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 29 Aug 2023 at 12:14, Chetan <chetansuttra...@gmail.com> wrote:
>
>> Hi,
>>
>> If we are taking this up, then would ask can we support multicolumn stats
>> such as :
>> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3
>> This should help in estimating better for conditions involving c1 and c2
>>
>> Thanks.
>>
>> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> short answer on top of my head
>>>
>>> My point was with regard to  Cost Based Optimizer (CBO) in traditional
>>> databases. The concept of a rowkey in HBase is somewhat similar to that of
>>> a primary key in RDBMS.
>>> Now in databases with automatic deduplication features (i.e. ignore
>>> duplication of rowkey), inserting 100 rows with the same rowkey actually
>>> results in only one physical entry in the database due to deduplication.
>>> Therefore, the new statistical value added should be 1, reflecting the
>>> distinct physical entry. If the rowkey is already present in HBase, the
>>> value would indeed be 0, indicating that no new physical entry was created.
>>> We need to take into account the underlying deduplication mechanism of the
>>> database in use to ensure that statistical values accurately represent the
>>> unique physical data entries.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 29 Aug 2023 at 02:07, Jia Fan <fanjia1...@gmail.com> wrote:
>>>
>>>> For those databases with automatic deduplication capabilities, such as
>>>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is
>>>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase
>>>> already contains this rowkey, the value would be 0. How should we handle
>>>> this situation?
>>>>
>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2023年8月29日周二 07:22写道：
>>>>
>>>>> I have never been fond of the notion that measuring inserts, updates,
>>>>> and deletes (referred to as DML) is the sole criterion for signaling a
>>>>> necessity to update statistics for Spark's CBO. Nevertheless, in the
>>>>> absence of an alternative mechanism, it seems this is the only approach at
>>>>> our disposal (can we use AI for it 😁). Personally, I would prefer some
>>>>> form of indication regarding shifts in the distribution of values in the
>>>>> histogram, overall density, and similar indicators. The decision to 
>>>>> execute
>>>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around
>>>>> column-level statistics, which is why I would tend to focus on monitoring
>>>>> individual column-level statistics to detect any signals warranting a
>>>>> statistics update.
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a
>>>>>> case of old wine in a new bottle. The objective, I assume, is to employ
>>>>>> dynamic sampling to enhance the optimizer's capacity to create effective
>>>>>> execution plans without the burden of complete I/O and in less time.
>>>>>>
>>>>>> For instance:
>>>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent
>>>>>>
>>>>>> This approach could potentially aid in estimating deltas by utilizing
>>>>>> sampling.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <raksonrak...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I would like to propose the incremental collection of statistics in
>>>>>>> spark. SPARK-44817
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-44817> has been raised
>>>>>>> for the same.
>>>>>>>
>>>>>>> Currently, spark invalidates the stats after data changing commands
>>>>>>> which would make CBO non-functional. To update these stats, user either
>>>>>>> needs to run `ANALYZE TABLE` command or turn
>>>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
>>>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full 
>>>>>>> table
>>>>>>> scan while the other one only updates table and partition stats and can 
>>>>>>> be
>>>>>>> costly in certain cases.
>>>>>>>
>>>>>>> The goal of this proposal is to collect stats incrementally while
>>>>>>> executing data changing commands by utilizing the framework introduced 
>>>>>>> in
>>>>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>>>>>>>
>>>>>>> SPIP Document has been attached along with JIRA:
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>>>>>>>
>>>>>>> Hive also supports automatic collection of statistics to keep the
>>>>>>> stats consistent.
>>>>>>> I can find multiple spark JIRAs asking for the same:
>>>>>>> https://issues.apache.org/jira/browse/SPARK-28872
>>>>>>> https://issues.apache.org/jira/browse/SPARK-33825
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rakesh
>>>>>>>
>>>>>>
>>
>> --
>> --
>> Regards,
>> Chetan
>>
>> +353899475147
>> +919665562626
>>
>>

Re: [DISCUSS] Incremental statistics collection

Reply via email to