Thanks for the review Mich. I have updated the Q4 with as concise information as possible and left the detailed explanation to Appendix.
here is the updated answer to the Q4 <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.xe0x4i9gc1dg> Thank you, Pavan On Wed, Aug 23, 2023 at 2:46 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Pavan, > > I started reading your SPIP but have difficulty understanding it in detail. > > Specifically under Q4, " What is new in your approach and why do you > think it will be successful?", I believe it would be better to remove the > plots and focus on "what this proposed solution is going to add to the > current play". At this stage a concise briefing would be appreciated and > the specific plots should be left to the Appendix. > > HTH > > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wSpQtgviw$> > > > https://en.everybodywiki.com/Mich_Talebzadeh > <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!Z1-Qlb9LL5r97D1tGWz_pKDVDYm-S_n99e_jhraM5XA4B058OHmw47z_FmbEVHeXdgLqEkkvS4W88hGkTBlB4wR3SukiIw$> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 20 Aug 2023 at 07:40, Pavan Kotikalapudi <pkotikalap...@twilio.com> > wrote: > >> IMO ML might be good for cluster scheduler but for the core DRA algorithm >> of SSS I believe we should start with some primitives of Structured >> streaming. I would love to get some reviews on the doc and opinions on the >> feasibility of the solution. >> >> We have seen quite some savings using this solution in our team, Would >> like to listen to the dev community to see if they are looking >> for/interested in DRA for structured streaming. >> >> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Thank you for your comments. >>> >>> My vision of integrating machine learning (ML) into Spark Structured >>> Streaming (SSS) for capacity planning and performance optimization seems to >>> be promising. By leveraging ML techniques, I believe that we can >>> potentially create predictive models that enhance the efficiency and >>> resource allocation of the data processing pipelines. Here are some >>> potential benefits and considerations for adding ML to SSS for capacity >>> planning. However, I stand corrected >>> >>> 1. >>> >>> *Predictive Capacity Planning:* ML models can analyze historical >>> data (that we discussed already), workloads, and trends to predict future >>> resource needs accurately. This enables proactive scaling and allocation >>> of >>> resources, ensuring optimal performance during high-demand periods, such >>> as >>> times of high trades. >>> 2. >>> >>> *Real-time Decision Making: *ML can be used to make real-time >>> decisions on resource allocation (software and cluster) based on current >>> data and conditions, allowing for dynamic adjustments to meet the >>> processing demands. >>> 3. >>> >>> *Complex Data Analysis: *In a heterogeneous setup involving multiple >>> databases, ML can analyze various factors like data read and write times >>> from different databases, data volumes, and data distribution patterns to >>> optimize the overall data processing flow. >>> 4. >>> >>> *Anomaly Detection: *ML models can identify unusual patterns or >>> performance deviations, alerting us to potential issues before they >>> impact >>> the system. >>> 5. >>> >>> Integration with Monitoring: ML models can work alongside monitoring >>> tools, gathering real-time data on various performance metrics, and using >>> this data for making intelligent decisions on capacity and resource >>> allocation. >>> >>> However, there are some important considerations to keep in mind: >>> >>> 1. >>> >>> *Model Training: *ML models require training and validation using >>> relevant data. Our DS colleagues need to define appropriate features, >>> select the right ML algorithms, and fine-tune the model parameters to >>> achieve optimal performance. >>> 2. >>> >>> *Complexity:* Integrating ML adds complexity to our architecture. >>> Moreover, we need to have the necessary expertise in both Spark >>> Structured >>> Streaming and machine learning to design, implement, and maintain the >>> system effectively. >>> 3. >>> >>> *Resource Overhead: *ML algorithms can be resource-intensive. We >>> ought to consider the additional computational requirements, especially >>> during the model training and inference phases. >>> 4. >>> >>> In summary, this idea of utilizing ML for capacity planning in Spark >>> Structured Streaming can possibly hold significant potential for >>> improving >>> system performance and resource utilization. Having said that, I totally >>> agree that we need to evaluate the feasibility, potential benefits, and >>> challenges and we will need involving experts in both Spark and machine >>> learning to ensure a successful outcome. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson < >>> martin.anders...@kambi.com> wrote: >>> >>>> IMO, using any kind of machine learning or AI for DRA is overkill. The >>>> effort involved would be considerable and likely counterproductive, >>>> compared to a more conventional approach of comparing the rate of incoming >>>> stream data with the effort of handling previous data rates. >>>> ------------------------------ >>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>> *Sent:* Tuesday, August 8, 2023 19:59 >>>> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com> >>>> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >>>> *Subject:* Re: Dynamic resource allocation for structured streaming >>>> [SPARK-24815] >>>> >>>> >>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>> recognize the sender and know the content is safe. DO NOT provide your >>>> username or password. >>>> >>>> I am currently contemplating and sharing my thoughts openly. >>>> Considering our reliance on previously collected statistics (as mentioned >>>> earlier), it raises the question of why we couldn't integrate certain >>>> machine learning elements into Spark Structured Streaming? While this might >>>> slightly deviate from our current topic, I am not an expert in machine >>>> learning. However, there are individuals who possess the expertise to >>>> assist us in exploring this avenue. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi < >>>> pkotikalap...@twilio.com> wrote: >>>> >>>> Listeners are the best resources to the allocation manager afaik... It >>>> already has SparkListener >>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$> >>>> that >>>> it utilizes. We can use it to extract more information (like processing >>>> times). >>>> The one with more information regarding streaming query resides in sql >>>> module >>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$> >>>> though. >>>> >>>> Thanks >>>> >>>> Pavan >>>> >>>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>> Hi Pavan or anyone else >>>> >>>> Is there any way one access the matrix displayed on SparkGUI? For >>>> example the readings for processing time? Can these be acessed? >>>> >>>> Thanks >>>> >>>> For example, >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi < >>>> pkotikalap...@twilio.com> wrote: >>>> >>>> Thanks for the review Mich, >>>> >>>> Yes, the configuration parameters we end up setting would be based on >>>> the trigger interval. >>>> >>>> > If you are going to have additional indicators why not look at >>>> scheduling delay as well >>>> Yes. The implementation is based on scheduling delays, not for pending >>>> tasks of the current stage but rather pending tasks of all the stages >>>> in a micro-batch >>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$> >>>> (hence >>>> trigger interval). >>>> >>>> > we ought to utilise the historical statistics collected under the >>>> checkpointing directory to get more accurate statistics >>>> You are right! This is just a simple implementation based on one >>>> factor, we should also look into other indicators as well If that would >>>> help build a better scaling algorithm. >>>> >>>> Thank you, >>>> >>>> Pavan >>>> >>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> I glanced over the design doc. >>>> >>>> You are providing certain configuration parameters plus some settings >>>> based on static values. For example: >>>> >>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s >>>> >>>> I cannot see any use of <processing time> which ought to be at least >>>> half of the batch interval to have the correct margins (confidence level). >>>> If >>>> you are going to have additional indicators why not look at scheduling >>>> delay as well. Moreover most of the needed statistics are also available to >>>> set accurate values. My inclination is that this is a great effort but >>>> we ought to utilise the historical statistics collected under >>>> checkpointing directory to get more accurate statistics. I will review >>>> the design document in duew course >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi >>>> <pkotikalap...@twilio.com.invalid> wrote: >>>> >>>> Hi Spark Dev, >>>> >>>> I have extended traditional DRA to work for structured streaming >>>> use-case. >>>> >>>> Here is an initial Implementation draft PR >>>> https://github.com/apache/spark/pull/42352 >>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$> >>>> and >>>> design doc: >>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing >>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$> >>>> >>>> Please review and let me know what you think. >>>> >>>> Thank you, >>>> >>>> Pavan >>>> >>>>