Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Pavan Kotikalapudi Sun, 12 Nov 2023 17:28:28 -0800

Here is an initial Implementation draft PR
https://github.com/apache/spark/pull/42352 and design doc:
https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing



On Sun, Nov 12, 2023 at 5:24 PM Pavan Kotikalapudi <pkotikalap...@twilio.com>
wrote:

> Hi Dev community,
>
> Just bumping to see if there are more reviews to evaluate this idea of
> adding auto-scaling to structured streaming.
>
> Thanks again,
>
> Pavan
>
> On Wed, Aug 23, 2023 at 2:49 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Thanks for the review Mich.
>>
>> I have updated the Q4 with as concise information as possible and left
>> the detailed explanation to Appendix.
>>
>> here is the updated answer to the Q4
>> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg>
>>
>> Thank you,
>>
>> Pavan
>>
>> On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Pavan,
>>>
>>> I started reading your SPIP but have difficulty understanding it in
>>> detail.
>>>
>>> Specifically under Q4, " What is new in your approach and why do you
>>> think it will be successful?", I believe it would be better to remove the
>>> plots and focus on "what this proposed solution is going to add to the
>>> current play". At this stage a concise briefing would be appreciated and
>>> the specific plots should be left to the Appendix.
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi <
>>> pkotikalap...@twilio.com> wrote:
>>>
>>>> IMO ML might be good for cluster scheduler but for the core DRA
>>>> algorithm of SSS I believe we should start with some primitives of
>>>> Structured streaming. I would love to get some reviews on the doc and
>>>> opinions on the feasibility of the solution.
>>>>
>>>> We have seen quite some savings using this solution in our team, Would
>>>> like to listen to the dev community to see if they are looking
>>>> for/interested in DRA for structured streaming.
>>>>
>>>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Thank you for your comments.
>>>>>
>>>>> My vision of integrating machine learning (ML) into Spark Structured
>>>>> Streaming (SSS) for capacity planning and performance optimization seems 
>>>>> to
>>>>> be promising. By leveraging ML techniques, I believe that we can
>>>>> potentially create predictive models that enhance the efficiency and
>>>>> resource allocation of the data processing pipelines. Here are some
>>>>> potential benefits and considerations for adding ML to SSS for capacity
>>>>> planning. However, I stand corrected
>>>>>
>>>>>    1.
>>>>>
>>>>>    *Predictive Capacity Planning:* ML models can analyze historical
>>>>>    data (that we discussed already), workloads, and trends to predict 
>>>>> future
>>>>>    resource needs accurately. This enables proactive scaling and 
>>>>> allocation of
>>>>>    resources, ensuring optimal performance during high-demand periods, 
>>>>> such as
>>>>>    times of high trades.
>>>>>    2.
>>>>>
>>>>>    *Real-time Decision Making: *ML can be used to make real-time
>>>>>    decisions on resource allocation (software and cluster) based on 
>>>>> current
>>>>>    data and conditions, allowing for dynamic adjustments to meet the
>>>>>    processing demands.
>>>>>    3.
>>>>>
>>>>>    *Complex Data Analysis: *In a heterogeneous setup involving
>>>>>    multiple databases, ML can analyze various factors like data read and 
>>>>> write
>>>>>    times from different databases, data volumes, and data distribution
>>>>>    patterns to optimize the overall data processing flow.
>>>>>    4.
>>>>>
>>>>>    *Anomaly Detection: *ML models can identify unusual patterns or
>>>>>    performance deviations, alerting us to potential issues before they 
>>>>> impact
>>>>>    the system.
>>>>>    5.
>>>>>
>>>>>    Integration with Monitoring: ML models can work alongside
>>>>>    monitoring tools, gathering real-time data on various performance 
>>>>> metrics,
>>>>>    and using this data for making intelligent decisions on capacity and
>>>>>    resource allocation.
>>>>>
>>>>> However, there are some important considerations to keep in mind:
>>>>>
>>>>>    1.
>>>>>
>>>>>    *Model Training: *ML models require training and validation using
>>>>>    relevant data. Our DS colleagues need to define appropriate features,
>>>>>    select the right ML algorithms, and fine-tune the model parameters to
>>>>>    achieve optimal performance.
>>>>>    2.
>>>>>
>>>>>    *Complexity:* Integrating ML adds complexity to our architecture.
>>>>>    Moreover, we need to have the necessary expertise in both Spark 
>>>>> Structured
>>>>>    Streaming and machine learning to design, implement, and maintain the
>>>>>    system effectively.
>>>>>    3.
>>>>>
>>>>>    *Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>>    ought to consider the additional computational requirements, especially
>>>>>    during the model training and inference phases.
>>>>>    4.
>>>>>
>>>>>    In summary, this idea of utilizing ML for capacity planning in
>>>>>    Spark Structured Streaming can possibly hold significant potential for
>>>>>    improving system performance and resource utilization. Having said 
>>>>> that, I
>>>>>    totally agree that we need to evaluate the feasibility, potential 
>>>>> benefits,
>>>>>    and challenges and we will need involving experts in both Spark and 
>>>>> machine
>>>>>    learning to ensure a successful outcome.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>>>> martin.anders...@kambi.com> wrote:
>>>>>
>>>>>> IMO, using any kind of machine learning or AI for DRA is overkill.
>>>>>> The effort involved would be considerable and likely counterproductive,
>>>>>> compared to a more conventional approach of comparing the rate of 
>>>>>> incoming
>>>>>> stream data with the effort of handling previous data rates.
>>>>>> ------------------------------
>>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> *Sent:* Tuesday, August 8, 2023 19:59
>>>>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com>
>>>>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org>
>>>>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>>>>> [SPARK-24815]
>>>>>>
>>>>>>
>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>>> username or password.
>>>>>>
>>>>>> I am currently contemplating and sharing my thoughts openly.
>>>>>> Considering our reliance on previously collected statistics (as mentioned
>>>>>> earlier), it raises the question of why we couldn't integrate certain
>>>>>> machine learning elements into Spark Structured Streaming? While this 
>>>>>> might
>>>>>> slightly deviate from our current topic, I am not an expert in machine
>>>>>> learning. However, there are individuals who possess the expertise to
>>>>>> assist us in exploring this avenue.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi <
>>>>>> pkotikalap...@twilio.com> wrote:
>>>>>>
>>>>>> Listeners are the best resources to the allocation manager  afaik...
>>>>>> It already has SparkListener
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$>
>>>>>>  that
>>>>>> it utilizes. We can use it to extract more information (like processing
>>>>>> times).
>>>>>> The one with more information regarding streaming query resides in sql
>>>>>> module
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$>
>>>>>> though.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Pavan
>>>>>>
>>>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Pavan or anyone else
>>>>>>
>>>>>> Is there any way one access the matrix displayed on SparkGUI? For
>>>>>> example the readings for processing time? Can these be acessed?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> For example,
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi <
>>>>>> pkotikalap...@twilio.com> wrote:
>>>>>>
>>>>>> Thanks for the review Mich,
>>>>>>
>>>>>> Yes, the configuration parameters we end up setting would be based on
>>>>>> the trigger interval.
>>>>>>
>>>>>> > If you are going to have additional indicators why not look at
>>>>>> scheduling delay as well
>>>>>> Yes. The implementation is based on scheduling delays, not for
>>>>>> pending tasks of the current stage but rather pending tasks of all
>>>>>> the stages in a micro-batch
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
>>>>>>  (hence
>>>>>> trigger interval).
>>>>>>
>>>>>> > we ought to utilise the historical statistics collected under the
>>>>>> checkpointing directory to get more accurate statistics
>>>>>> You are right! This is just a simple implementation based on one
>>>>>> factor, we should also look into other indicators as well If that would
>>>>>> help build a better scaling algorithm.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Pavan
>>>>>>
>>>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I glanced over the design doc.
>>>>>>
>>>>>> You are providing certain configuration parameters plus some settings
>>>>>> based on static values. For example:
>>>>>>
>>>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>>>>
>>>>>> I cannot see any use of <processing time> which ought to be at least
>>>>>> half of the batch interval to have the correct margins (confidence 
>>>>>> level). If
>>>>>> you are going to have additional indicators why not look at scheduling
>>>>>> delay as well. Moreover most of the needed statistics are also available 
>>>>>> to
>>>>>> set accurate values. My inclination is that this is a great effort
>>>>>> but we ought to utilise the historical statistics collected under
>>>>>> checkpointing directory to get more accurate statistics. I will
>>>>>> review the design document in duew course
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>>>>> <pkotikalap...@twilio.com.invalid> wrote:
>>>>>>
>>>>>> Hi Spark Dev,
>>>>>>
>>>>>> I have extended traditional DRA to work for structured streaming
>>>>>> use-case.
>>>>>>
>>>>>> Here is an initial Implementation draft PR
>>>>>> https://github.com/apache/spark/pull/42352
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$>
>>>>>>  and
>>>>>> design doc:
>>>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$>
>>>>>>
>>>>>> Please review and let me know what you think.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Pavan
>>>>>>
>>>>>>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Reply via email to