Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Pavan Kotikalapudi Wed, 23 Aug 2023 14:50:31 -0700

Thanks for the review Mich.

I have updated the Q4 with as concise information as possible and left the
detailed explanation to Appendix.


here is the updated answer to the Q4
<https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg>

Thank you,

Pavan

On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <[email protected]>
wrote:

> Hi Pavan,
>
> I started reading your SPIP but have difficulty understanding it in detail.
>
> Specifically under Q4, " What is new in your approach and why do you
> think it will be successful?", I believe it would be better to remove the
> plots and focus on "what this proposed solution is going to add to the
> current play". At this stage a concise briefing would be appreciated and
> the specific plots should be left to the Appendix.
>
> HTH
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi <[email protected]>
> wrote:
>
>> IMO ML might be good for cluster scheduler but for the core DRA algorithm
>> of SSS I believe we should start with some primitives of Structured
>> streaming. I would love to get some reviews on the doc and opinions on the
>> feasibility of the solution.
>>
>> We have seen quite some savings using this solution in our team, Would
>> like to listen to the dev community to see if they are looking
>> for/interested in DRA for structured streaming.
>>
>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> Thank you for your comments.
>>>
>>> My vision of integrating machine learning (ML) into Spark Structured
>>> Streaming (SSS) for capacity planning and performance optimization seems to
>>> be promising. By leveraging ML techniques, I believe that we can
>>> potentially create predictive models that enhance the efficiency and
>>> resource allocation of the data processing pipelines. Here are some
>>> potential benefits and considerations for adding ML to SSS for capacity
>>> planning. However, I stand corrected
>>>
>>>    1.
>>>
>>>    *Predictive Capacity Planning:* ML models can analyze historical
>>>    data (that we discussed already), workloads, and trends to predict future
>>>    resource needs accurately. This enables proactive scaling and allocation 
>>> of
>>>    resources, ensuring optimal performance during high-demand periods, such 
>>> as
>>>    times of high trades.
>>>    2.
>>>
>>>    *Real-time Decision Making: *ML can be used to make real-time
>>>    decisions on resource allocation (software and cluster) based on current
>>>    data and conditions, allowing for dynamic adjustments to meet the
>>>    processing demands.
>>>    3.
>>>
>>>    *Complex Data Analysis: *In a heterogeneous setup involving multiple
>>>    databases, ML can analyze various factors like data read and write times
>>>    from different databases, data volumes, and data distribution patterns to
>>>    optimize the overall data processing flow.
>>>    4.
>>>
>>>    *Anomaly Detection: *ML models can identify unusual patterns or
>>>    performance deviations, alerting us to potential issues before they 
>>> impact
>>>    the system.
>>>    5.
>>>
>>>    Integration with Monitoring: ML models can work alongside monitoring
>>>    tools, gathering real-time data on various performance metrics, and using
>>>    this data for making intelligent decisions on capacity and resource
>>>    allocation.
>>>
>>> However, there are some important considerations to keep in mind:
>>>
>>>    1.
>>>
>>>    *Model Training: *ML models require training and validation using
>>>    relevant data. Our DS colleagues need to define appropriate features,
>>>    select the right ML algorithms, and fine-tune the model parameters to
>>>    achieve optimal performance.
>>>    2.
>>>
>>>    *Complexity:* Integrating ML adds complexity to our architecture.
>>>    Moreover, we need to have the necessary expertise in both Spark 
>>> Structured
>>>    Streaming and machine learning to design, implement, and maintain the
>>>    system effectively.
>>>    3.
>>>
>>>    *Resource Overhead: *ML algorithms can be resource-intensive. We
>>>    ought to consider the additional computational requirements, especially
>>>    during the model training and inference phases.
>>>    4.
>>>
>>>    In summary, this idea of utilizing ML for capacity planning in Spark
>>>    Structured Streaming can possibly hold significant potential for 
>>> improving
>>>    system performance and resource utilization. Having said that, I totally
>>>    agree that we need to evaluate the feasibility, potential benefits, and
>>>    challenges and we will need involving experts in both Spark and machine
>>>    learning to ensure a successful outcome.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>> [email protected]> wrote:
>>>
>>>> IMO, using any kind of machine learning or AI for DRA is overkill. The
>>>> effort involved would be considerable and likely counterproductive,
>>>> compared to a more conventional approach of comparing the rate of incoming
>>>> stream data with the effort of handling previous data rates.
>>>> ------------------------------
>>>> *From:* Mich Talebzadeh <[email protected]>
>>>> *Sent:* Tuesday, August 8, 2023 19:59
>>>> *To:* Pavan Kotikalapudi <[email protected]>
>>>> *Cc:* [email protected] <[email protected]>
>>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>>> [SPARK-24815]
>>>>
>>>>
>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>> username or password.
>>>>
>>>> I am currently contemplating and sharing my thoughts openly.
>>>> Considering our reliance on previously collected statistics (as mentioned
>>>> earlier), it raises the question of why we couldn't integrate certain
>>>> machine learning elements into Spark Structured Streaming? While this might
>>>> slightly deviate from our current topic, I am not an expert in machine
>>>> learning. However, there are individuals who possess the expertise to
>>>> assist us in exploring this avenue.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi <
>>>> [email protected]> wrote:
>>>>
>>>> Listeners are the best resources to the allocation manager  afaik... It
>>>> already has SparkListener
>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$>
>>>>  that
>>>> it utilizes. We can use it to extract more information (like processing
>>>> times).
>>>> The one with more information regarding streaming query resides in sql
>>>> module
>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$>
>>>> though.
>>>>
>>>> Thanks
>>>>
>>>> Pavan
>>>>
>>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>> Hi Pavan or anyone else
>>>>
>>>> Is there any way one access the matrix displayed on SparkGUI? For
>>>> example the readings for processing time? Can these be acessed?
>>>>
>>>> Thanks
>>>>
>>>> For example,
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi <
>>>> [email protected]> wrote:
>>>>
>>>> Thanks for the review Mich,
>>>>
>>>> Yes, the configuration parameters we end up setting would be based on
>>>> the trigger interval.
>>>>
>>>> > If you are going to have additional indicators why not look at
>>>> scheduling delay as well
>>>> Yes. The implementation is based on scheduling delays, not for pending
>>>> tasks of the current stage but rather pending tasks of all the stages
>>>> in a micro-batch
>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
>>>>  (hence
>>>> trigger interval).
>>>>
>>>> > we ought to utilise the historical statistics collected under the
>>>> checkpointing directory to get more accurate statistics
>>>> You are right! This is just a simple implementation based on one
>>>> factor, we should also look into other indicators as well If that would
>>>> help build a better scaling algorithm.
>>>>
>>>> Thank you,
>>>>
>>>> Pavan
>>>>
>>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I glanced over the design doc.
>>>>
>>>> You are providing certain configuration parameters plus some settings
>>>> based on static values. For example:
>>>>
>>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>>
>>>> I cannot see any use of <processing time> which ought to be at least
>>>> half of the batch interval to have the correct margins (confidence level). 
>>>> If
>>>> you are going to have additional indicators why not look at scheduling
>>>> delay as well. Moreover most of the needed statistics are also available to
>>>> set accurate values. My inclination is that this is a great effort but
>>>> we ought to utilise the historical statistics collected under
>>>> checkpointing directory to get more accurate statistics. I will review
>>>> the design document in duew course
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>>> <[email protected]> wrote:
>>>>
>>>> Hi Spark Dev,
>>>>
>>>> I have extended traditional DRA to work for structured streaming
>>>> use-case.
>>>>
>>>> Here is an initial Implementation draft PR
>>>> https://github.com/apache/spark/pull/42352
>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$>
>>>>  and
>>>> design doc:
>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$>
>>>>
>>>> Please review and let me know what you think.
>>>>
>>>> Thank you,
>>>>
>>>> Pavan
>>>>
>>>>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Reply via email to