Hi PMC members, Bumping this idea for one last time to see if there are any approvals to take it forward.
Here is an initial Implementation draft PR https://github.com/apache/spark/pull/42352 and design doc: https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing Thank you, Pavan On Mon, Nov 13, 2023 at 6:57 AM Pavan Kotikalapudi <pkotikalap...@twilio.com> wrote: > > > Here is an initial Implementation draft PR > https://github.com/apache/spark/pull/42352 and design doc: > https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing > > > On Sun, Nov 12, 2023 at 5:24 PM Pavan Kotikalapudi < > pkotikalap...@twilio.com> wrote: > >> Hi Dev community, >> >> Just bumping to see if there are more reviews to evaluate this idea of >> adding auto-scaling to structured streaming. >> >> Thanks again, >> >> Pavan >> >> On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi < >> pkotikalap...@twilio.com> wrote: >> >>> Thanks for the review Mich. >>> >>> I have updated the Q4 with as concise information as possible and left >>> the detailed explanation to Appendix. >>> >>> here is the updated answer to the Q4 >>> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg> >>> >>> Thank you, >>> >>> Pavan >>> >>> On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi Pavan, >>>> >>>> I started reading your SPIP but have difficulty understanding it in >>>> detail. >>>> >>>> Specifically under Q4, " What is new in your approach and why do you >>>> think it will be successful?", I believe it would be better to remove the >>>> plots and focus on "what this proposed solution is going to add to the >>>> current play". At this stage a concise briefing would be appreciated and >>>> the specific plots should be left to the Appendix. >>>> >>>> HTH >>>> >>>> >>>> Mich Talebzadeh, >>>> Distinguished Technologist, Solutions Architect & Engineer >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi < >>>> pkotikalap...@twilio.com> wrote: >>>> >>>>> IMO ML might be good for cluster scheduler but for the core DRA >>>>> algorithm of SSS I believe we should start with some primitives of >>>>> Structured streaming. I would love to get some reviews on the doc and >>>>> opinions on the feasibility of the solution. >>>>> >>>>> We have seen quite some savings using this solution in our team, Would >>>>> like to listen to the dev community to see if they are looking >>>>> for/interested in DRA for structured streaming. >>>>> >>>>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Thank you for your comments. >>>>>> >>>>>> My vision of integrating machine learning (ML) into Spark Structured >>>>>> Streaming (SSS) for capacity planning and performance optimization seems >>>>>> to >>>>>> be promising. By leveraging ML techniques, I believe that we can >>>>>> potentially create predictive models that enhance the efficiency and >>>>>> resource allocation of the data processing pipelines. Here are some >>>>>> potential benefits and considerations for adding ML to SSS for capacity >>>>>> planning. However, I stand corrected >>>>>> >>>>>> 1. >>>>>> >>>>>> *Predictive Capacity Planning:* ML models can analyze historical >>>>>> data (that we discussed already), workloads, and trends to predict >>>>>> future >>>>>> resource needs accurately. This enables proactive scaling and >>>>>> allocation of >>>>>> resources, ensuring optimal performance during high-demand periods, >>>>>> such as >>>>>> times of high trades. >>>>>> 2. >>>>>> >>>>>> *Real-time Decision Making: *ML can be used to make real-time >>>>>> decisions on resource allocation (software and cluster) based on >>>>>> current >>>>>> data and conditions, allowing for dynamic adjustments to meet the >>>>>> processing demands. >>>>>> 3. >>>>>> >>>>>> *Complex Data Analysis: *In a heterogeneous setup involving >>>>>> multiple databases, ML can analyze various factors like data read and >>>>>> write >>>>>> times from different databases, data volumes, and data distribution >>>>>> patterns to optimize the overall data processing flow. >>>>>> 4. >>>>>> >>>>>> *Anomaly Detection: *ML models can identify unusual patterns or >>>>>> performance deviations, alerting us to potential issues before they >>>>>> impact >>>>>> the system. >>>>>> 5. >>>>>> >>>>>> Integration with Monitoring: ML models can work alongside >>>>>> monitoring tools, gathering real-time data on various performance >>>>>> metrics, >>>>>> and using this data for making intelligent decisions on capacity and >>>>>> resource allocation. >>>>>> >>>>>> However, there are some important considerations to keep in mind: >>>>>> >>>>>> 1. >>>>>> >>>>>> *Model Training: *ML models require training and validation using >>>>>> relevant data. Our DS colleagues need to define appropriate features, >>>>>> select the right ML algorithms, and fine-tune the model parameters to >>>>>> achieve optimal performance. >>>>>> 2. >>>>>> >>>>>> *Complexity:* Integrating ML adds complexity to our architecture. >>>>>> Moreover, we need to have the necessary expertise in both Spark >>>>>> Structured >>>>>> Streaming and machine learning to design, implement, and maintain the >>>>>> system effectively. >>>>>> 3. >>>>>> >>>>>> *Resource Overhead: *ML algorithms can be resource-intensive. We >>>>>> ought to consider the additional computational requirements, >>>>>> especially >>>>>> during the model training and inference phases. >>>>>> 4. >>>>>> >>>>>> In summary, this idea of utilizing ML for capacity planning in >>>>>> Spark Structured Streaming can possibly hold significant potential for >>>>>> improving system performance and resource utilization. Having said >>>>>> that, I >>>>>> totally agree that we need to evaluate the feasibility, potential >>>>>> benefits, >>>>>> and challenges and we will need involving experts in both Spark and >>>>>> machine >>>>>> learning to ensure a successful outcome. >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Solutions Architect/Engineering Lead >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson < >>>>>> martin.anders...@kambi.com> wrote: >>>>>> >>>>>>> IMO, using any kind of machine learning or AI for DRA is overkill. >>>>>>> The effort involved would be considerable and likely counterproductive, >>>>>>> compared to a more conventional approach of comparing the rate of >>>>>>> incoming >>>>>>> stream data with the effort of handling previous data rates. >>>>>>> ------------------------------ >>>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>> *Sent:* Tuesday, August 8, 2023 19:59 >>>>>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com> >>>>>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >>>>>>> *Subject:* Re: Dynamic resource allocation for structured streaming >>>>>>> [SPARK-24815] >>>>>>> >>>>>>> >>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>>>>> recognize the sender and know the content is safe. DO NOT provide your >>>>>>> username or password. >>>>>>> >>>>>>> I am currently contemplating and sharing my thoughts openly. >>>>>>> Considering our reliance on previously collected statistics (as >>>>>>> mentioned >>>>>>> earlier), it raises the question of why we couldn't integrate certain >>>>>>> machine learning elements into Spark Structured Streaming? While this >>>>>>> might >>>>>>> slightly deviate from our current topic, I am not an expert in machine >>>>>>> learning. However, there are individuals who possess the expertise to >>>>>>> assist us in exploring this avenue. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Solutions Architect/Engineering Lead >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi < >>>>>>> pkotikalap...@twilio.com> wrote: >>>>>>> >>>>>>> Listeners are the best resources to the allocation manager afaik... >>>>>>> It already has SparkListener >>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$> >>>>>>> that >>>>>>> it utilizes. We can use it to extract more information (like processing >>>>>>> times). >>>>>>> The one with more information regarding streaming query resides in sql >>>>>>> module >>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$> >>>>>>> though. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Pavan >>>>>>> >>>>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Pavan or anyone else >>>>>>> >>>>>>> Is there any way one access the matrix displayed on SparkGUI? For >>>>>>> example the readings for processing time? Can these be acessed? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> For example, >>>>>>> Mich Talebzadeh, >>>>>>> Solutions Architect/Engineering Lead >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi < >>>>>>> pkotikalap...@twilio.com> wrote: >>>>>>> >>>>>>> Thanks for the review Mich, >>>>>>> >>>>>>> Yes, the configuration parameters we end up setting would be based >>>>>>> on the trigger interval. >>>>>>> >>>>>>> > If you are going to have additional indicators why not look at >>>>>>> scheduling delay as well >>>>>>> Yes. The implementation is based on scheduling delays, not for >>>>>>> pending tasks of the current stage but rather pending tasks of all >>>>>>> the stages in a micro-batch >>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$> >>>>>>> (hence >>>>>>> trigger interval). >>>>>>> >>>>>>> > we ought to utilise the historical statistics collected under the >>>>>>> checkpointing directory to get more accurate statistics >>>>>>> You are right! This is just a simple implementation based on one >>>>>>> factor, we should also look into other indicators as well If that would >>>>>>> help build a better scaling algorithm. >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Pavan >>>>>>> >>>>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I glanced over the design doc. >>>>>>> >>>>>>> You are providing certain configuration parameters plus some >>>>>>> settings based on static values. For example: >>>>>>> >>>>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s >>>>>>> >>>>>>> I cannot see any use of <processing time> which ought to be at least >>>>>>> half of the batch interval to have the correct margins (confidence >>>>>>> level). If >>>>>>> you are going to have additional indicators why not look at scheduling >>>>>>> delay as well. Moreover most of the needed statistics are also >>>>>>> available to >>>>>>> set accurate values. My inclination is that this is a great effort >>>>>>> but we ought to utilise the historical statistics collected under >>>>>>> checkpointing directory to get more accurate statistics. I will >>>>>>> review the design document in duew course >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Solutions Architect/Engineering Lead >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi >>>>>>> <pkotikalap...@twilio.com.invalid> wrote: >>>>>>> >>>>>>> Hi Spark Dev, >>>>>>> >>>>>>> I have extended traditional DRA to work for structured streaming >>>>>>> use-case. >>>>>>> >>>>>>> Here is an initial Implementation draft PR >>>>>>> https://github.com/apache/spark/pull/42352 >>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$> >>>>>>> and >>>>>>> design doc: >>>>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing >>>>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$> >>>>>>> >>>>>>> Please review and let me know what you think. >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Pavan >>>>>>> >>>>>>>