IMO ML might be good for cluster scheduler but for the core DRA algorithm of SSS I believe we should start with some primitives of Structured streaming. I would love to get some reviews on the doc and opinions on the feasibility of the solution.
We have seen quite some savings using this solution in our team, Would like to listen to the dev community to see if they are looking for/interested in DRA for structured streaming. On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thank you for your comments. > > My vision of integrating machine learning (ML) into Spark Structured > Streaming (SSS) for capacity planning and performance optimization seems to > be promising. By leveraging ML techniques, I believe that we can > potentially create predictive models that enhance the efficiency and > resource allocation of the data processing pipelines. Here are some > potential benefits and considerations for adding ML to SSS for capacity > planning. However, I stand corrected > > 1. > > *Predictive Capacity Planning:* ML models can analyze historical data > (that we discussed already), workloads, and trends to predict future > resource needs accurately. This enables proactive scaling and allocation of > resources, ensuring optimal performance during high-demand periods, such as > times of high trades. > 2. > > *Real-time Decision Making: *ML can be used to make real-time > decisions on resource allocation (software and cluster) based on current > data and conditions, allowing for dynamic adjustments to meet the > processing demands. > 3. > > *Complex Data Analysis: *In a heterogeneous setup involving multiple > databases, ML can analyze various factors like data read and write times > from different databases, data volumes, and data distribution patterns to > optimize the overall data processing flow. > 4. > > *Anomaly Detection: *ML models can identify unusual patterns or > performance deviations, alerting us to potential issues before they impact > the system. > 5. > > Integration with Monitoring: ML models can work alongside monitoring > tools, gathering real-time data on various performance metrics, and using > this data for making intelligent decisions on capacity and resource > allocation. > > However, there are some important considerations to keep in mind: > > 1. > > *Model Training: *ML models require training and validation using > relevant data. Our DS colleagues need to define appropriate features, > select the right ML algorithms, and fine-tune the model parameters to > achieve optimal performance. > 2. > > *Complexity:* Integrating ML adds complexity to our architecture. > Moreover, we need to have the necessary expertise in both Spark Structured > Streaming and machine learning to design, implement, and maintain the > system effectively. > 3. > > *Resource Overhead: *ML algorithms can be resource-intensive. We ought > to consider the additional computational requirements, especially during > the model training and inference phases. > 4. > > In summary, this idea of utilizing ML for capacity planning in Spark > Structured Streaming can possibly hold significant potential for improving > system performance and resource utilization. Having said that, I totally > agree that we need to evaluate the feasibility, potential benefits, and > challenges and we will need involving experts in both Spark and machine > learning to ensure a successful outcome. > > HTH > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> > > > https://en.everybodywiki.com/Mich_Talebzadeh > <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 14 Aug 2023 at 14:58, Martin Andersson <martin.anders...@kambi.com> > wrote: > >> IMO, using any kind of machine learning or AI for DRA is overkill. The >> effort involved would be considerable and likely counterproductive, >> compared to a more conventional approach of comparing the rate of incoming >> stream data with the effort of handling previous data rates. >> ------------------------------ >> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >> *Sent:* Tuesday, August 8, 2023 19:59 >> *To:* Pavan Kotikalapudi <pkotikalap...@twilio.com> >> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >> *Subject:* Re: Dynamic resource allocation for structured streaming >> [SPARK-24815] >> >> >> EXTERNAL SENDER. Do not click links or open attachments unless you >> recognize the sender and know the content is safe. DO NOT provide your >> username or password. >> >> I am currently contemplating and sharing my thoughts openly. Considering >> our reliance on previously collected statistics (as mentioned earlier), it >> raises the question of why we couldn't integrate certain machine learning >> elements into Spark Structured Streaming? While this might slightly deviate >> from our current topic, I am not an expert in machine learning. However, >> there are individuals who possess the expertise to assist us in exploring >> this avenue. >> >> HTH >> >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi <pkotikalap...@twilio.com> >> wrote: >> >> Listeners are the best resources to the allocation manager afaik... It >> already has SparkListener >> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$> >> that >> it utilizes. We can use it to extract more information (like processing >> times). >> The one with more information regarding streaming query resides in sql >> module >> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$> >> though. >> >> Thanks >> >> Pavan >> >> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi Pavan or anyone else >> >> Is there any way one access the matrix displayed on SparkGUI? For example >> the readings for processing time? Can these be acessed? >> >> Thanks >> >> For example, >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi <pkotikalap...@twilio.com> >> wrote: >> >> Thanks for the review Mich, >> >> Yes, the configuration parameters we end up setting would be based on the >> trigger interval. >> >> > If you are going to have additional indicators why not look at >> scheduling delay as well >> Yes. The implementation is based on scheduling delays, not for pending >> tasks of the current stage but rather pending tasks of all the stages in >> a micro-batch >> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$> >> (hence >> trigger interval). >> >> > we ought to utilise the historical statistics collected under the >> checkpointing directory to get more accurate statistics >> You are right! This is just a simple implementation based on one factor, >> we should also look into other indicators as well If that would help build >> a better scaling algorithm. >> >> Thank you, >> >> Pavan >> >> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi, >> >> I glanced over the design doc. >> >> You are providing certain configuration parameters plus some settings >> based on static values. For example: >> >> spark.dynamicAllocation.schedulerBacklogTimeout": 54s >> >> I cannot see any use of <processing time> which ought to be at least half >> of the batch interval to have the correct margins (confidence level). If >> you are going to have additional indicators why not look at scheduling >> delay as well. Moreover most of the needed statistics are also available to >> set accurate values. My inclination is that this is a great effort but >> we ought to utilise the historical statistics collected under >> checkpointing directory to get more accurate statistics. I will review >> the design document in duew course >> >> HTH >> >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi >> <pkotikalap...@twilio.com.invalid> wrote: >> >> Hi Spark Dev, >> >> I have extended traditional DRA to work for structured streaming >> use-case. >> >> Here is an initial Implementation draft PR >> https://github.com/apache/spark/pull/42352 >> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$> >> and >> design doc: >> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing >> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$> >> >> Please review and let me know what you think. >> >> Thank you, >> >> Pavan >> >>