Here is an initial Implementation draft PR https://github.com/apache/spark/pull/42352 and design doc: https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
On Sun, Nov 12, 2023 at 5:24 PM Pavan Kotikalapudi <pkotikalap...@twilio.com> wrote: > Hi Dev community, > > Just bumping to see if there are more reviews to evaluate this idea of > adding auto-scaling to structured streaming. > > Thanks again, > > Pavan > > On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi < > pkotikalap...@twilio.com> wrote: > >> Thanks for the review Mich. >> >> I have updated the Q4 with as concise information as possible and left >> the detailed explanation to Appendix. >> >> here is the updated answer to the Q4 >> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg> >> >> Thank you, >> >> Pavan >> >> On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Pavan, >>> >>> I started reading your SPIP but have difficulty understanding it in >>> detail. >>> >>> Specifically under Q4, " What is new in your approach and why do you >>> think it will be successful?", I believe it would be better to remove the >>> plots and focus on "what this proposed solution is going to add to the >>> current play". At this stage a concise briefing would be appreciated and >>> the specific plots should be left to the Appendix. >>> >>> HTH >>> >>> >>> Mich Talebzadeh, >>> Distinguished Technologist, Solutions Architect & Engineer >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi < >>> pkotikalap...@twilio.com> wrote: >>> >>>> IMO ML might be good for cluster scheduler but for the core DRA >>>> algorithm of SSS I believe we should start with some primitives of >>>> Structured streaming. I would love to get some reviews on the doc and >>>> opinions on the feasibility of the solution. >>>> >>>> We have seen quite some savings using this solution in our team, Would >>>> like to listen to the dev community to see if they are looking >>>> for/interested in DRA for structured streaming. >>>> >>>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Thank you for your comments. >>>>> >>>>> My vision of integrating machine learning (ML) into Spark Structured >>>>> Streaming (SSS) for capacity planning and performance optimization seems >>>>> to >>>>> be promising. By leveraging ML techniques, I believe that we can >>>>> potentially create predictive models that enhance the efficiency and >>>>> resource allocation of the data processing pipelines. Here are some >>>>> potential benefits and considerations for adding ML to SSS for capacity >>>>> planning. However, I stand corrected >>>>> >>>>> 1. >>>>> >>>>> *Predictive Capacity Planning:* ML models can analyze historical >>>>> data (that we discussed already), workloads, and trends to predict >>>>> future >>>>> resource needs accurately. This enables proactive scaling and >>>>> allocation of >>>>> resources, ensuring optimal performance during high-demand periods, >>>>> such as >>>>> times of high trades. >>>>> 2. >>>>> >>>>> *Real-time Decision Making: *ML can be used to make real-time >>>>> decisions on resource allocation (software and cluster) based on >>>>> current >>>>> data and conditions, allowing for dynamic adjustments to meet the >>>>> processing demands. >>>>> 3. >>>>> >>>>> *Complex Data Analysis: *In a heterogeneous setup involving >>>>> multiple databases, ML can analyze various factors like data read and >>>>> write >>>>> times from different databases, data volumes, and data distribution >>>>> patterns to optimize the overall data processing flow. >>>>> 4. >>>>> >>>>> *Anomaly Detection: *ML models can identify unusual patterns or >>>>> performance deviations, alerting us to potential issues before they >>>>> impact >>>>> the system. >>>>> 5. >>>>> >>>>> Integration with Monitoring: ML models can work alongside >>>>> monitoring tools, gathering real-time data on various performance >>>>> metrics, >>>>> and using this data for making intelligent decisions on capacity and >>>>> resource allocation. >>>>> >>>>> However, there are some important considerations to keep in mind: >>>>> >>>>> 1. >>>>> >>>>> *Model Training: *ML models require training and validation using >>>>> relevant data. Our DS colleagues need to define appropriate features, >>>>> select the right ML algorithms, and fine-tune the model parameters to >>>>> achieve optimal performance. >>>>> 2. >>>>> >>>>> *Complexity:* Integrating ML adds complexity to our architecture. >>>>> Moreover, we need to have the necessary expertise in both Spark >>>>> Structured >>>>> Streaming and machine learning to design, implement, and maintain the >>>>> system effectively. >>>>> 3. >>>>> >>>>> *Resource Overhead: *ML algorithms can be resource-intensive. We >>>>> ought to consider the additional computational requirements, especially >>>>> during the model training and inference phases. >>>>> 4. >>>>> >>>>> In summary, this idea of utilizing ML for capacity planning in >>>>> Spark Structured Streaming can possibly hold significant potential for >>>>> improving system performance and resource utilization. Having said >>>>> that, I >>>>> totally agree that we need to evaluate the feasibility, potential >>>>> benefits, >>>>> and challenges and we will need involving experts in both Spark and >>>>> machine >>>>> learning to ensure a successful outcome. >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson < >>>>> martin.anders...@kambi.com> wrote: >>>>> >>>>>> IMO, using any kind of machine learning or AI for DRA is overkill. >>>>>> The effort involved would be considerable and likely counterproductive, >>>>>> compared to a more conventional approach of comparing the rate of >>>>>> incoming >>>>>> stream data with the effort of handling previous data rates. >>>>>> ------------------------------ >>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> *Sent:* Tuesday, August 8, 2023 19:59 >>>>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com> >>>>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >>>>>> *Subject:* Re: Dynamic resource allocation for structured streaming >>>>>> [SPARK-24815] >>>>>> >>>>>> >>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>>>> recognize the sender and know the content is safe. DO NOT provide your >>>>>> username or password. >>>>>> >>>>>> I am currently contemplating and sharing my thoughts openly. >>>>>> Considering our reliance on previously collected statistics (as mentioned >>>>>> earlier), it raises the question of why we couldn't integrate certain >>>>>> machine learning elements into Spark Structured Streaming? While this >>>>>> might >>>>>> slightly deviate from our current topic, I am not an expert in machine >>>>>> learning. However, there are individuals who possess the expertise to >>>>>> assist us in exploring this avenue. >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Solutions Architect/Engineering Lead >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi < >>>>>> pkotikalap...@twilio.com> wrote: >>>>>> >>>>>> Listeners are the best resources to the allocation manager afaik... >>>>>> It already has SparkListener >>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$> >>>>>> that >>>>>> it utilizes. We can use it to extract more information (like processing >>>>>> times). >>>>>> The one with more information regarding streaming query resides in sql >>>>>> module >>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$> >>>>>> though. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Pavan >>>>>> >>>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> Hi Pavan or anyone else >>>>>> >>>>>> Is there any way one access the matrix displayed on SparkGUI? For >>>>>> example the readings for processing time? Can these be acessed? >>>>>> >>>>>> Thanks >>>>>> >>>>>> For example, >>>>>> Mich Talebzadeh, >>>>>> Solutions Architect/Engineering Lead >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi < >>>>>> pkotikalap...@twilio.com> wrote: >>>>>> >>>>>> Thanks for the review Mich, >>>>>> >>>>>> Yes, the configuration parameters we end up setting would be based on >>>>>> the trigger interval. >>>>>> >>>>>> > If you are going to have additional indicators why not look at >>>>>> scheduling delay as well >>>>>> Yes. The implementation is based on scheduling delays, not for >>>>>> pending tasks of the current stage but rather pending tasks of all >>>>>> the stages in a micro-batch >>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$> >>>>>> (hence >>>>>> trigger interval). >>>>>> >>>>>> > we ought to utilise the historical statistics collected under the >>>>>> checkpointing directory to get more accurate statistics >>>>>> You are right! This is just a simple implementation based on one >>>>>> factor, we should also look into other indicators as well If that would >>>>>> help build a better scaling algorithm. >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Pavan >>>>>> >>>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I glanced over the design doc. >>>>>> >>>>>> You are providing certain configuration parameters plus some settings >>>>>> based on static values. For example: >>>>>> >>>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s >>>>>> >>>>>> I cannot see any use of <processing time> which ought to be at least >>>>>> half of the batch interval to have the correct margins (confidence >>>>>> level). If >>>>>> you are going to have additional indicators why not look at scheduling >>>>>> delay as well. Moreover most of the needed statistics are also available >>>>>> to >>>>>> set accurate values. My inclination is that this is a great effort >>>>>> but we ought to utilise the historical statistics collected under >>>>>> checkpointing directory to get more accurate statistics. I will >>>>>> review the design document in duew course >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Solutions Architect/Engineering Lead >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi >>>>>> <pkotikalap...@twilio.com.invalid> wrote: >>>>>> >>>>>> Hi Spark Dev, >>>>>> >>>>>> I have extended traditional DRA to work for structured streaming >>>>>> use-case. >>>>>> >>>>>> Here is an initial Implementation draft PR >>>>>> https://github.com/apache/spark/pull/42352 >>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$> >>>>>> and >>>>>> design doc: >>>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing >>>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$> >>>>>> >>>>>> Please review and let me know what you think. >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Pavan >>>>>> >>>>>>