Hi,

Impressive, yet in the realm of classic DBMSs, it could be seen as a case
of old wine in a new bottle. The objective, I assume, is to employ dynamic
sampling to enhance the optimizer's capacity to create effective execution
plans without the burden of complete I/O and in less time.

For instance:
ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent

This approach could potentially aid in estimating deltas by utilizing
sampling.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <raksonrak...@gmail.com> wrote:

> Hi all,
>
> I would like to propose the incremental collection of statistics in spark.
> SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817> has been
> raised for the same.
>
> Currently, spark invalidates the stats after data changing commands which
> would make CBO non-functional. To update these stats, user either needs to
> run `ANALYZE TABLE` command or turn
> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
> their own drawbacks, executing `ANALYZE TABLE` command triggers full table
> scan while the other one only updates table and partition stats and can be
> costly in certain cases.
>
> The goal of this proposal is to collect stats incrementally while
> executing data changing commands by utilizing the framework introduced in
> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>.
>
> SPIP Document has been attached along with JIRA:
>
> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing
>
> Hive also supports automatic collection of statistics to keep the stats
> consistent.
> I can find multiple spark JIRAs asking for the same:
> https://issues.apache.org/jira/browse/SPARK-28872
> https://issues.apache.org/jira/browse/SPARK-33825
>
> Regards,
> Rakesh
>

Reply via email to