Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Pavan Kotikalapudi Mon, 01 Jan 2024 02:28:12 -0800

Hi PMC members,

Bumping this idea for one last time to see if there are any approvals to
take it forward.


Here is an initial Implementation draft PR
https://github.com/apache/spark/pull/42352 and design doc:
https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing


Thank you,

Pavan

On Mon, Nov 13, 2023 at 6:57 AM Pavan Kotikalapudi <pkotikalap...@twilio.com>
wrote:

>
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
>
> On Sun, Nov 12, 2023 at 5:24 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Hi Dev community,
>>
>> Just bumping to see if there are more reviews to evaluate this idea of
>> adding auto-scaling to structured streaming.
>>
>> Thanks again,
>>
>> Pavan
>>
>> On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi <
>> pkotikalap...@twilio.com> wrote:
>>
>>> Thanks for the review Mich.
>>>
>>> I have updated the Q4 with as concise information as possible and left
>>> the detailed explanation to Appendix.
>>>
>>> here is the updated answer to the Q4
>>> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg>
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi Pavan,
>>>>
>>>> I started reading your SPIP but have difficulty understanding it in
>>>> detail.
>>>>
>>>> Specifically under Q4, " What is new in your approach and why do you
>>>> think it will be successful?", I believe it would be better to remove the
>>>> plots and focus on "what this proposed solution is going to add to the
>>>> current play". At this stage a concise briefing would be appreciated and
>>>> the specific plots should be left to the Appendix.
>>>>
>>>> HTH
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Distinguished Technologist, Solutions Architect & Engineer
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi <
>>>> pkotikalap...@twilio.com> wrote:
>>>>
>>>>> IMO ML might be good for cluster scheduler but for the core DRA
>>>>> algorithm of SSS I believe we should start with some primitives of
>>>>> Structured streaming. I would love to get some reviews on the doc and
>>>>> opinions on the feasibility of the solution.
>>>>>
>>>>> We have seen quite some savings using this solution in our team, Would
>>>>> like to listen to the dev community to see if they are looking
>>>>> for/interested in DRA for structured streaming.
>>>>>
>>>>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Thank you for your comments.
>>>>>>
>>>>>> My vision of integrating machine learning (ML) into Spark Structured
>>>>>> Streaming (SSS) for capacity planning and performance optimization seems 
>>>>>> to
>>>>>> be promising. By leveraging ML techniques, I believe that we can
>>>>>> potentially create predictive models that enhance the efficiency and
>>>>>> resource allocation of the data processing pipelines. Here are some
>>>>>> potential benefits and considerations for adding ML to SSS for capacity
>>>>>> planning. However, I stand corrected
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    *Predictive Capacity Planning:* ML models can analyze historical
>>>>>>    data (that we discussed already), workloads, and trends to predict 
>>>>>> future
>>>>>>    resource needs accurately. This enables proactive scaling and 
>>>>>> allocation of
>>>>>>    resources, ensuring optimal performance during high-demand periods, 
>>>>>> such as
>>>>>>    times of high trades.
>>>>>>    2.
>>>>>>
>>>>>>    *Real-time Decision Making: *ML can be used to make real-time
>>>>>>    decisions on resource allocation (software and cluster) based on 
>>>>>> current
>>>>>>    data and conditions, allowing for dynamic adjustments to meet the
>>>>>>    processing demands.
>>>>>>    3.
>>>>>>
>>>>>>    *Complex Data Analysis: *In a heterogeneous setup involving
>>>>>>    multiple databases, ML can analyze various factors like data read and 
>>>>>> write
>>>>>>    times from different databases, data volumes, and data distribution
>>>>>>    patterns to optimize the overall data processing flow.
>>>>>>    4.
>>>>>>
>>>>>>    *Anomaly Detection: *ML models can identify unusual patterns or
>>>>>>    performance deviations, alerting us to potential issues before they 
>>>>>> impact
>>>>>>    the system.
>>>>>>    5.
>>>>>>
>>>>>>    Integration with Monitoring: ML models can work alongside
>>>>>>    monitoring tools, gathering real-time data on various performance 
>>>>>> metrics,
>>>>>>    and using this data for making intelligent decisions on capacity and
>>>>>>    resource allocation.
>>>>>>
>>>>>> However, there are some important considerations to keep in mind:
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    *Model Training: *ML models require training and validation using
>>>>>>    relevant data. Our DS colleagues need to define appropriate features,
>>>>>>    select the right ML algorithms, and fine-tune the model parameters to
>>>>>>    achieve optimal performance.
>>>>>>    2.
>>>>>>
>>>>>>    *Complexity:* Integrating ML adds complexity to our architecture.
>>>>>>    Moreover, we need to have the necessary expertise in both Spark 
>>>>>> Structured
>>>>>>    Streaming and machine learning to design, implement, and maintain the
>>>>>>    system effectively.
>>>>>>    3.
>>>>>>
>>>>>>    *Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>>>    ought to consider the additional computational requirements, 
>>>>>> especially
>>>>>>    during the model training and inference phases.
>>>>>>    4.
>>>>>>
>>>>>>    In summary, this idea of utilizing ML for capacity planning in
>>>>>>    Spark Structured Streaming can possibly hold significant potential for
>>>>>>    improving system performance and resource utilization. Having said 
>>>>>> that, I
>>>>>>    totally agree that we need to evaluate the feasibility, potential 
>>>>>> benefits,
>>>>>>    and challenges and we will need involving experts in both Spark and 
>>>>>> machine
>>>>>>    learning to ensure a successful outcome.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>>>>> martin.anders...@kambi.com> wrote:
>>>>>>
>>>>>>> IMO, using any kind of machine learning or AI for DRA is overkill.
>>>>>>> The effort involved would be considerable and likely counterproductive,
>>>>>>> compared to a more conventional approach of comparing the rate of 
>>>>>>> incoming
>>>>>>> stream data with the effort of handling previous data rates.
>>>>>>> ------------------------------
>>>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>> *Sent:* Tuesday, August 8, 2023 19:59
>>>>>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com>
>>>>>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org>
>>>>>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>>>>>> [SPARK-24815]
>>>>>>>
>>>>>>>
>>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>>>> username or password.
>>>>>>>
>>>>>>> I am currently contemplating and sharing my thoughts openly.
>>>>>>> Considering our reliance on previously collected statistics (as 
>>>>>>> mentioned
>>>>>>> earlier), it raises the question of why we couldn't integrate certain
>>>>>>> machine learning elements into Spark Structured Streaming? While this 
>>>>>>> might
>>>>>>> slightly deviate from our current topic, I am not an expert in machine
>>>>>>> learning. However, there are individuals who possess the expertise to
>>>>>>> assist us in exploring this avenue.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi <
>>>>>>> pkotikalap...@twilio.com> wrote:
>>>>>>>
>>>>>>> Listeners are the best resources to the allocation manager  afaik...
>>>>>>> It already has SparkListener
>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$>
>>>>>>>  that
>>>>>>> it utilizes. We can use it to extract more information (like processing
>>>>>>> times).
>>>>>>> The one with more information regarding streaming query resides in sql
>>>>>>> module
>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$>
>>>>>>> though.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Pavan
>>>>>>>
>>>>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Pavan or anyone else
>>>>>>>
>>>>>>> Is there any way one access the matrix displayed on SparkGUI? For
>>>>>>> example the readings for processing time? Can these be acessed?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> For example,
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi <
>>>>>>> pkotikalap...@twilio.com> wrote:
>>>>>>>
>>>>>>> Thanks for the review Mich,
>>>>>>>
>>>>>>> Yes, the configuration parameters we end up setting would be based
>>>>>>> on the trigger interval.
>>>>>>>
>>>>>>> > If you are going to have additional indicators why not look at
>>>>>>> scheduling delay as well
>>>>>>> Yes. The implementation is based on scheduling delays, not for
>>>>>>> pending tasks of the current stage but rather pending tasks of all
>>>>>>> the stages in a micro-batch
>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
>>>>>>>  (hence
>>>>>>> trigger interval).
>>>>>>>
>>>>>>> > we ought to utilise the historical statistics collected under the
>>>>>>> checkpointing directory to get more accurate statistics
>>>>>>> You are right! This is just a simple implementation based on one
>>>>>>> factor, we should also look into other indicators as well If that would
>>>>>>> help build a better scaling algorithm.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Pavan
>>>>>>>
>>>>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I glanced over the design doc.
>>>>>>>
>>>>>>> You are providing certain configuration parameters plus some
>>>>>>> settings based on static values. For example:
>>>>>>>
>>>>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>>>>>
>>>>>>> I cannot see any use of <processing time> which ought to be at least
>>>>>>> half of the batch interval to have the correct margins (confidence 
>>>>>>> level). If
>>>>>>> you are going to have additional indicators why not look at scheduling
>>>>>>> delay as well. Moreover most of the needed statistics are also 
>>>>>>> available to
>>>>>>> set accurate values. My inclination is that this is a great effort
>>>>>>> but we ought to utilise the historical statistics collected under
>>>>>>> checkpointing directory to get more accurate statistics. I will
>>>>>>> review the design document in duew course
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>>>>>> <pkotikalap...@twilio.com.invalid> wrote:
>>>>>>>
>>>>>>> Hi Spark Dev,
>>>>>>>
>>>>>>> I have extended traditional DRA to work for structured streaming
>>>>>>> use-case.
>>>>>>>
>>>>>>> Here is an initial Implementation draft PR
>>>>>>> https://github.com/apache/spark/pull/42352
>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$>
>>>>>>>  and
>>>>>>> design doc:
>>>>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>>>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$>
>>>>>>>
>>>>>>> Please review and let me know what you think.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Pavan
>>>>>>>
>>>>>>>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Reply via email to