Thanks for the detailed explanation.
Regards, Chetan On Tue, Aug 29, 2023, 4:50 PM Mich Talebzadeh <[email protected]> wrote: > OK, let us take a deeper look here > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS *(c1, c2), c3* > > In above, we are *explicitly grouping columns c1 and c2 together for > which we want to compute statistic*s. Additionally, we are also *computing > statistics for column c3 independen*t*ly*. This approach *allows CBO to > treat columns c1 and c2 as a group and compute joint statistics for them, > while computing separate statistics for column c3.* > > If columns c1 and c2 are frequently used together in conditions, I > concur it makes sense to compute joint statistics for them by using the > above syntax. On the other hand, if each column has its own significance > and the relationship between them is not crucial, we can use > > ANALYSE TABLE mytable COMPUTE STATISTICS FOR COLUMNS > *c1, c2, c3* > > This syntax can be used to compute separate statistics for each column. > > So your mileage varies. > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 29 Aug 2023 at 12:14, Chetan <[email protected]> wrote: > >> Hi, >> >> If we are taking this up, then would ask can we support multicolumn stats >> such as : >> ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS (c1,c2), c3 >> This should help in estimating better for conditions involving c1 and c2 >> >> Thanks. >> >> On Tue, 29 Aug 2023 at 09:05, Mich Talebzadeh <[email protected]> >> wrote: >> >>> short answer on top of my head >>> >>> My point was with regard to Cost Based Optimizer (CBO) in traditional >>> databases. The concept of a rowkey in HBase is somewhat similar to that of >>> a primary key in RDBMS. >>> Now in databases with automatic deduplication features (i.e. ignore >>> duplication of rowkey), inserting 100 rows with the same rowkey actually >>> results in only one physical entry in the database due to deduplication. >>> Therefore, the new statistical value added should be 1, reflecting the >>> distinct physical entry. If the rowkey is already present in HBase, the >>> value would indeed be 0, indicating that no new physical entry was created. >>> We need to take into account the underlying deduplication mechanism of the >>> database in use to ensure that statistical values accurately represent the >>> unique physical data entries. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 29 Aug 2023 at 02:07, Jia Fan <[email protected]> wrote: >>> >>>> For those databases with automatic deduplication capabilities, such as >>>> hbase, we have inserted 100 rows with the same rowkey, but in fact there is >>>> only one in hbase. Is the new statistical value we added 100 or 1, or hbase >>>> already contains this rowkey, the value would be 0. How should we handle >>>> this situation? >>>> >>>> Mich Talebzadeh <[email protected]> 于2023年8月29日周二 07:22写道: >>>> >>>>> I have never been fond of the notion that measuring inserts, updates, >>>>> and deletes (referred to as DML) is the sole criterion for signaling a >>>>> necessity to update statistics for Spark's CBO. Nevertheless, in the >>>>> absence of an alternative mechanism, it seems this is the only approach at >>>>> our disposal (can we use AI for it 😁). Personally, I would prefer some >>>>> form of indication regarding shifts in the distribution of values in the >>>>> histogram, overall density, and similar indicators. The decision to >>>>> execute >>>>> "ANALYZE TABLE xyz COMPUTE STATISTICS FOR COLUMNS" revolves around >>>>> column-level statistics, which is why I would tend to focus on monitoring >>>>> individual column-level statistics to detect any signals warranting a >>>>> statistics update. >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, 26 Aug 2023 at 21:30, Mich Talebzadeh < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Impressive, yet in the realm of classic DBMSs, it could be seen as a >>>>>> case of old wine in a new bottle. The objective, I assume, is to employ >>>>>> dynamic sampling to enhance the optimizer's capacity to create effective >>>>>> execution plans without the burden of complete I/O and in less time. >>>>>> >>>>>> For instance: >>>>>> ANALYZE TABLE xyz COMPUTE STATISTICS WITH SAMPLING = 5 percent >>>>>> >>>>>> This approach could potentially aid in estimating deltas by utilizing >>>>>> sampling. >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Distinguished Technologist, Solutions Architect & Engineer >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, 26 Aug 2023 at 20:58, RAKSON RAKESH <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I would like to propose the incremental collection of statistics in >>>>>>> spark. SPARK-44817 >>>>>>> <https://issues.apache.org/jira/browse/SPARK-44817> has been raised >>>>>>> for the same. >>>>>>> >>>>>>> Currently, spark invalidates the stats after data changing commands >>>>>>> which would make CBO non-functional. To update these stats, user either >>>>>>> needs to run `ANALYZE TABLE` command or turn >>>>>>> `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have >>>>>>> their own drawbacks, executing `ANALYZE TABLE` command triggers full >>>>>>> table >>>>>>> scan while the other one only updates table and partition stats and can >>>>>>> be >>>>>>> costly in certain cases. >>>>>>> >>>>>>> The goal of this proposal is to collect stats incrementally while >>>>>>> executing data changing commands by utilizing the framework introduced >>>>>>> in >>>>>>> SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>. >>>>>>> >>>>>>> SPIP Document has been attached along with JIRA: >>>>>>> >>>>>>> https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing >>>>>>> >>>>>>> Hive also supports automatic collection of statistics to keep the >>>>>>> stats consistent. >>>>>>> I can find multiple spark JIRAs asking for the same: >>>>>>> https://issues.apache.org/jira/browse/SPARK-28872 >>>>>>> https://issues.apache.org/jira/browse/SPARK-33825 >>>>>>> >>>>>>> Regards, >>>>>>> Rakesh >>>>>>> >>>>>> >> >> -- >> -- >> Regards, >> Chetan >> >> +353899475147 >> +919665562626 >> >>
