Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-20 Thread Pavan Kotikalapudi
Here is the link to the voting thread
https://lists.apache.org/thread/rlwqrw6ddxdkbvkp78kpd0zgvglgbbp8.

Thank you,

Pavan

On Wed, Jan 17, 2024 at 7:15 PM Pavan Kotikalapudi 
wrote:

> Thanks for the +1, I will propose voting in a new thread now.
>
> - Pavan
>
> On Wed, Jan 17, 2024 at 5:28 PM Mich Talebzadeh 
> wrote:
>
>> I think we have discussed this enough and I consider it as a useful
>> feature.. I propose a vote on it.
>>
>> + 1 for me
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>  wrote:
>>
>>> Hi Spark Dev,
>>>
>>> I have extended traditional DRA to work for structured streaming
>>> use-case.
>>>
>>> Here is an initial Implementation draft PR
>>> https://github.com/apache/spark/pull/42352
>>> 
>>>  and
>>> design doc:
>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>> 
>>>
>>> Please review and let me know what you think.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Pavan Kotikalapudi
Thanks for the +1, I will propose voting in a new thread now.

- Pavan

On Wed, Jan 17, 2024 at 5:28 PM Mich Talebzadeh 
wrote:

> I think we have discussed this enough and I consider it as a useful
> feature.. I propose a vote on it.
>
> + 1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>  wrote:
>
>> Hi Spark Dev,
>>
>> I have extended traditional DRA to work for structured streaming
>> use-case.
>>
>> Here is an initial Implementation draft PR
>> https://github.com/apache/spark/pull/42352
>> 
>>  and
>> design doc:
>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>> 
>>
>> Please review and let me know what you think.
>>
>> Thank you,
>>
>> Pavan
>>
>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-17 Thread Mich Talebzadeh
I think we have discussed this enough and I consider it as a useful
feature.. I propose a vote on it.

+ 1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> I have extended traditional DRA to work for structured streaming
> use-case.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
> Please review and let me know what you think.
>
> Thank you,
>
> Pavan
>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-16 Thread Adam Hobbs
Hi,

This is my first time using the dev mailing list so I hope this is the correct 
way to do it.

I would like to lend my support to this proposal and offer my experiences as a 
consumer of spark, and specifically Spark Structured Streaming (SSS). I am more 
of an cloud infrastructure devops engineer that a spark/scala coder.

Over the last couple of years I have been a member of a team that has built a 
banking application on top of SSS, kafka and microservices.  We currently run 
about 40 SSS apps that run 24x7.  The load on the jobs fluctuates throughout 
the day based on customer activity and overnight there is a large amount of 
data that comes from core banking batch runs.

We have been down the path of trying to make DRA work within our spark 
infrastructure and it has taken a long time to properly understand that the 
existing DRA mechanisms in spark are mostly useless for SSS.  We chased dynamic 
allocation for some time until we finally realised it is focussed on batch jobs 
and that it would not work properly with our SSS jobs (documentation relating 
to SSS and DRA is sparse to non-existent and the fact that what DRA stuff is 
well documented isn't relevant to SSS was not at first clear).  Most of our 
jobs have enough data flow that they never hit the idle timeout that governs 
standard DRA.  Those that do have low data flow would tend to end up causing 
cluster flapping as scaling would take longer than it would take to process the 
data.

Eventually we have landed on the best stability and performance compromise by 
completely disabling all DRA and deploying our SSS apps at a static size that 
the resourcing can cope with daily peaks and overnight batch load.  Obviously 
this means that for much of the day the deployed apps are running very over 
provisioned.

Proper DRA that is built to work with SSS would be a massive money saver for us.

To me it seems that Pavan has a very good understanding of the same sort of 
issues that we have found and seems to have a working solution (I'm sure I read 
that he has his code in place and working successfully for his organisation)

I think it would be a great thing to get some form of DRA in place for SSS even 
if it is rudimentary in form as it will be a definite step up from what is 
essentially zero support that works with 24x7 style SSS apps.

If there is more that I can do to support this initiative and get this code 
included in an official Spark release, please let me know.


Regards,

Adam Hobbs



This communication is intended only for use of the addressee and may contain 
legally privileged and confidential information.
If you are not the addressee or intended recipient, you are notified that any 
dissemination, copying or use of any of the information is unauthorised.

The legal privilege and confidentiality attached to this e-mail is not waived, 
lost or destroyed by reason of a mistaken delivery to you.
If you have received this message in error, we would appreciate an immediate 
notification via e-mail to contac...@bendigoadelaide.com.au or by phoning 1300 
BENDIGO (1300 236 344), and ask that the e-mail be permanently deleted from 
your system.

Bendigo and Adelaide Bank Limited ABN 11 068 049 178




Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-05 Thread Mich Talebzadeh
 at 07:40, Pavan Kotikalapudi <
>>>>>>> pkotikalap...@twilio.com> wrote:
>>>>>>>
>>>>>>>> IMO ML might be good for cluster scheduler but for the core DRA
>>>>>>>> algorithm of SSS I believe we should start with some primitives of
>>>>>>>> Structured streaming. I would love to get some reviews on the doc and
>>>>>>>> opinions on the feasibility of the solution.
>>>>>>>>
>>>>>>>> We have seen quite some savings using this solution in our team,
>>>>>>>> Would like to listen to the dev community to see if they are looking
>>>>>>>> for/interested in DRA for structured streaming.
>>>>>>>>
>>>>>>>> On Mon, Aug 14, 2023 at 9:12 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you for your comments.
>>>>>>>>>
>>>>>>>>> My vision of integrating machine learning (ML) into Spark
>>>>>>>>> Structured Streaming (SSS) for capacity planning and performance
>>>>>>>>> optimization seems to be promising. By leveraging ML techniques, I 
>>>>>>>>> believe
>>>>>>>>> that we can potentially create predictive models that enhance the
>>>>>>>>> efficiency and resource allocation of the data processing pipelines. 
>>>>>>>>> Here
>>>>>>>>> are some potential benefits and considerations for adding ML to SSS 
>>>>>>>>> for
>>>>>>>>> capacity planning. However, I stand corrected
>>>>>>>>>
>>>>>>>>>1.
>>>>>>>>>
>>>>>>>>>*Predictive Capacity Planning:* ML models can analyze
>>>>>>>>>historical data (that we discussed already), workloads, and trends 
>>>>>>>>> to
>>>>>>>>>predict future resource needs accurately. This enables proactive 
>>>>>>>>> scaling
>>>>>>>>>and allocation of resources, ensuring optimal performance during
>>>>>>>>>high-demand periods, such as times of high trades.
>>>>>>>>>2.
>>>>>>>>>
>>>>>>>>>*Real-time Decision Making: *ML can be used to make real-time
>>>>>>>>>decisions on resource allocation (software and cluster) based on 
>>>>>>>>> current
>>>>>>>>>data and conditions, allowing for dynamic adjustments to meet the
>>>>>>>>>processing demands.
>>>>>>>>>3.
>>>>>>>>>
>>>>>>>>>*Complex Data Analysis: *In a heterogeneous setup involving
>>>>>>>>>multiple databases, ML can analyze various factors like data read 
>>>>>>>>> and write
>>>>>>>>>times from different databases, data volumes, and data distribution
>>>>>>>>>patterns to optimize the overall data processing flow.
>>>>>>>>>4.
>>>>>>>>>
>>>>>>>>>*Anomaly Detection: *ML models can identify unusual patterns
>>>>>>>>>or performance deviations, alerting us to potential issues before 
>>>>>>>>> they
>>>>>>>>>impact the system.
>>>>>>>>>5.
>>>>>>>>>
>>>>>>>>>Integration with Monitoring: ML models can work alongside
>>>>>>>>>monitoring tools, gathering real-time data on various performance 
>>>>>>>>> metrics,
>>>>>>>>>and using this data for making intelligent decisions on capacity 
>>>>>>>>> and
>>>>>>>>>resource allocation.
>>>>>>>>>
>>>>>>>>> However, there are some important considerations to keep in mind:
>>>>>>>>>
>>>>>>>>>1.
>>>>>>>>>
>>>>>>>>>*Model Training: *ML models require training and validati

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-05 Thread Pavan Kotikalapudi
uring optimal performance during
>>>>>>>>high-demand periods, such as times of high trades.
>>>>>>>>2.
>>>>>>>>
>>>>>>>>*Real-time Decision Making: *ML can be used to make real-time
>>>>>>>>decisions on resource allocation (software and cluster) based on 
>>>>>>>> current
>>>>>>>>data and conditions, allowing for dynamic adjustments to meet the
>>>>>>>>processing demands.
>>>>>>>>3.
>>>>>>>>
>>>>>>>>*Complex Data Analysis: *In a heterogeneous setup involving
>>>>>>>>multiple databases, ML can analyze various factors like data read 
>>>>>>>> and write
>>>>>>>>times from different databases, data volumes, and data distribution
>>>>>>>>patterns to optimize the overall data processing flow.
>>>>>>>>4.
>>>>>>>>
>>>>>>>>*Anomaly Detection: *ML models can identify unusual patterns or
>>>>>>>>performance deviations, alerting us to potential issues before they 
>>>>>>>> impact
>>>>>>>>the system.
>>>>>>>>5.
>>>>>>>>
>>>>>>>>Integration with Monitoring: ML models can work alongside
>>>>>>>>monitoring tools, gathering real-time data on various performance 
>>>>>>>> metrics,
>>>>>>>>and using this data for making intelligent decisions on capacity and
>>>>>>>>resource allocation.
>>>>>>>>
>>>>>>>> However, there are some important considerations to keep in mind:
>>>>>>>>
>>>>>>>>1.
>>>>>>>>
>>>>>>>>    *Model Training: *ML models require training and validation
>>>>>>>>using relevant data. Our DS colleagues need to define appropriate 
>>>>>>>> features,
>>>>>>>>select the right ML algorithms, and fine-tune the model parameters 
>>>>>>>> to
>>>>>>>>achieve optimal performance.
>>>>>>>>2.
>>>>>>>>
>>>>>>>>*Complexity:* Integrating ML adds complexity to our
>>>>>>>>architecture. Moreover, we need to have the necessary expertise in 
>>>>>>>> both
>>>>>>>>Spark Structured Streaming and machine learning to design, 
>>>>>>>> implement, and
>>>>>>>>maintain the system effectively.
>>>>>>>>3.
>>>>>>>>
>>>>>>>>*Resource Overhead: *ML algorithms can be resource-intensive.
>>>>>>>>We ought to consider the additional computational requirements, 
>>>>>>>> especially
>>>>>>>>during the model training and inference phases.
>>>>>>>>4.
>>>>>>>>
>>>>>>>>In summary, this idea of utilizing ML for capacity planning in
>>>>>>>>Spark Structured Streaming can possibly hold significant potential 
>>>>>>>> for
>>>>>>>>improving system performance and resource utilization. Having said 
>>>>>>>> that, I
>>>>>>>>totally agree that we need to evaluate the feasibility, potential 
>>>>>>>> benefits,
>>>>>>>>and challenges and we will need involving experts in both Spark and 
>>>>>>>> machine
>>>>>>>>learning to ensure a successful outcome.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-02 Thread Mich Talebzadeh
>>>>>>>monitoring tools, gathering real-time data on various performance 
>>>>>>> metrics,
>>>>>>>and using this data for making intelligent decisions on capacity and
>>>>>>>resource allocation.
>>>>>>>
>>>>>>> However, there are some important considerations to keep in mind:
>>>>>>>
>>>>>>>1.
>>>>>>>
>>>>>>>*Model Training: *ML models require training and validation
>>>>>>>using relevant data. Our DS colleagues need to define appropriate 
>>>>>>> features,
>>>>>>>select the right ML algorithms, and fine-tune the model parameters to
>>>>>>>achieve optimal performance.
>>>>>>>2.
>>>>>>>
>>>>>>>*Complexity:* Integrating ML adds complexity to our
>>>>>>>architecture. Moreover, we need to have the necessary expertise in 
>>>>>>> both
>>>>>>>Spark Structured Streaming and machine learning to design, 
>>>>>>> implement, and
>>>>>>>maintain the system effectively.
>>>>>>>3.
>>>>>>>
>>>>>>>*Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>>>>ought to consider the additional computational requirements, 
>>>>>>> especially
>>>>>>>during the model training and inference phases.
>>>>>>>4.
>>>>>>>
>>>>>>>In summary, this idea of utilizing ML for capacity planning in
>>>>>>>Spark Structured Streaming can possibly hold significant potential 
>>>>>>> for
>>>>>>>improving system performance and resource utilization. Having said 
>>>>>>> that, I
>>>>>>>totally agree that we need to evaluate the feasibility, potential 
>>>>>>> benefits,
>>>>>>>and challenges and we will need involving experts in both Spark and 
>>>>>>> machine
>>>>>>>learning to ensure a successful outcome.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>>>>>> martin.anders...@kambi.com> wrote:
>>>>>>>
>>>>>>>> IMO, using any kind of machine learning or AI for DRA is overkill.
>>>>>>>> The effort involved would be considerable and likely counterproductive,
>>>>>>>> compared to a more conventional approach of comparing the rate of 
>>>>>>>> incoming
>>>>>>>> stream data with the effort of handling previous data rates.
>>>>>>>> --
>>>>>>>> *From:* Mich Talebzadeh

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2024-01-01 Thread Pavan Kotikalapudi
potentially create predictive models that enhance the efficiency and
>>>>>> resource allocation of the data processing pipelines. Here are some
>>>>>> potential benefits and considerations for adding ML to SSS for capacity
>>>>>> planning. However, I stand corrected
>>>>>>
>>>>>>1.
>>>>>>
>>>>>>*Predictive Capacity Planning:* ML models can analyze historical
>>>>>>data (that we discussed already), workloads, and trends to predict 
>>>>>> future
>>>>>>resource needs accurately. This enables proactive scaling and 
>>>>>> allocation of
>>>>>>resources, ensuring optimal performance during high-demand periods, 
>>>>>> such as
>>>>>>times of high trades.
>>>>>>2.
>>>>>>
>>>>>>*Real-time Decision Making: *ML can be used to make real-time
>>>>>>decisions on resource allocation (software and cluster) based on 
>>>>>> current
>>>>>>data and conditions, allowing for dynamic adjustments to meet the
>>>>>>processing demands.
>>>>>>3.
>>>>>>
>>>>>>*Complex Data Analysis: *In a heterogeneous setup involving
>>>>>>multiple databases, ML can analyze various factors like data read and 
>>>>>> write
>>>>>>times from different databases, data volumes, and data distribution
>>>>>>patterns to optimize the overall data processing flow.
>>>>>>4.
>>>>>>
>>>>>>*Anomaly Detection: *ML models can identify unusual patterns or
>>>>>>performance deviations, alerting us to potential issues before they 
>>>>>> impact
>>>>>>the system.
>>>>>>5.
>>>>>>
>>>>>>Integration with Monitoring: ML models can work alongside
>>>>>>monitoring tools, gathering real-time data on various performance 
>>>>>> metrics,
>>>>>>and using this data for making intelligent decisions on capacity and
>>>>>>resource allocation.
>>>>>>
>>>>>> However, there are some important considerations to keep in mind:
>>>>>>
>>>>>>1.
>>>>>>
>>>>>>*Model Training: *ML models require training and validation using
>>>>>>relevant data. Our DS colleagues need to define appropriate features,
>>>>>>select the right ML algorithms, and fine-tune the model parameters to
>>>>>>achieve optimal performance.
>>>>>>2.
>>>>>>
>>>>>>*Complexity:* Integrating ML adds complexity to our architecture.
>>>>>>Moreover, we need to have the necessary expertise in both Spark 
>>>>>> Structured
>>>>>>Streaming and machine learning to design, implement, and maintain the
>>>>>>system effectively.
>>>>>>3.
>>>>>>
>>>>>>*Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>>>ought to consider the additional computational requirements, 
>>>>>> especially
>>>>>>    during the model training and inference phases.
>>>>>>    4.
>>>>>>
>>>>>>In summary, this idea of utilizing ML for capacity planning in
>>>>>>Spark Structured Streaming can possibly hold significant potential for
>>>>>>improving system performance and resource utilization. Having said 
>>>>>> that, I
>>>>>>totally agree that we need to evaluate the feasibility, potential 
>>>>>> benefits,
>>>>>>and challenges and we will need involving experts in both Spark and 
>>>>>> machine
>>>>>>learning to ensure a successful outcome.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-11-12 Thread Pavan Kotikalapudi
t;> such as
>>>>>times of high trades.
>>>>>2.
>>>>>
>>>>>*Real-time Decision Making: *ML can be used to make real-time
>>>>>decisions on resource allocation (software and cluster) based on 
>>>>> current
>>>>>data and conditions, allowing for dynamic adjustments to meet the
>>>>>processing demands.
>>>>>3.
>>>>>
>>>>>*Complex Data Analysis: *In a heterogeneous setup involving
>>>>>multiple databases, ML can analyze various factors like data read and 
>>>>> write
>>>>>times from different databases, data volumes, and data distribution
>>>>>patterns to optimize the overall data processing flow.
>>>>>4.
>>>>>
>>>>>*Anomaly Detection: *ML models can identify unusual patterns or
>>>>>performance deviations, alerting us to potential issues before they 
>>>>> impact
>>>>>the system.
>>>>>5.
>>>>>
>>>>>Integration with Monitoring: ML models can work alongside
>>>>>monitoring tools, gathering real-time data on various performance 
>>>>> metrics,
>>>>>and using this data for making intelligent decisions on capacity and
>>>>>resource allocation.
>>>>>
>>>>> However, there are some important considerations to keep in mind:
>>>>>
>>>>>1.
>>>>>
>>>>>*Model Training: *ML models require training and validation using
>>>>>relevant data. Our DS colleagues need to define appropriate features,
>>>>>select the right ML algorithms, and fine-tune the model parameters to
>>>>>achieve optimal performance.
>>>>>2.
>>>>>
>>>>>*Complexity:* Integrating ML adds complexity to our architecture.
>>>>>Moreover, we need to have the necessary expertise in both Spark 
>>>>> Structured
>>>>>Streaming and machine learning to design, implement, and maintain the
>>>>>system effectively.
>>>>>3.
>>>>>
>>>>>*Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>>ought to consider the additional computational requirements, especially
>>>>>during the model training and inference phases.
>>>>>4.
>>>>>
>>>>>In summary, this idea of utilizing ML for capacity planning in
>>>>>Spark Structured Streaming can possibly hold significant potential for
>>>>>improving system performance and resource utilization. Having said 
>>>>> that, I
>>>>>totally agree that we need to evaluate the feasibility, potential 
>>>>> benefits,
>>>>>and challenges and we will need involving experts in both Spark and 
>>>>> machine
>>>>>learning to ensure a successful outcome.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>>>> martin.anders.

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-11-12 Thread Pavan Kotikalapudi
atterns to optimize the overall data processing flow.
>>>>4.
>>>>
>>>>*Anomaly Detection: *ML models can identify unusual patterns or
>>>>performance deviations, alerting us to potential issues before they 
>>>> impact
>>>>the system.
>>>>5.
>>>>
>>>>Integration with Monitoring: ML models can work alongside
>>>>monitoring tools, gathering real-time data on various performance 
>>>> metrics,
>>>>and using this data for making intelligent decisions on capacity and
>>>>resource allocation.
>>>>
>>>> However, there are some important considerations to keep in mind:
>>>>
>>>>1.
>>>>
>>>>*Model Training: *ML models require training and validation using
>>>>relevant data. Our DS colleagues need to define appropriate features,
>>>>select the right ML algorithms, and fine-tune the model parameters to
>>>>achieve optimal performance.
>>>>2.
>>>>
>>>>*Complexity:* Integrating ML adds complexity to our architecture.
>>>>Moreover, we need to have the necessary expertise in both Spark 
>>>> Structured
>>>>Streaming and machine learning to design, implement, and maintain the
>>>>system effectively.
>>>>3.
>>>>
>>>>*Resource Overhead: *ML algorithms can be resource-intensive. We
>>>>ought to consider the additional computational requirements, especially
>>>>during the model training and inference phases.
>>>>4.
>>>>
>>>>In summary, this idea of utilizing ML for capacity planning in
>>>>Spark Structured Streaming can possibly hold significant potential for
>>>>improving system performance and resource utilization. Having said 
>>>> that, I
>>>>totally agree that we need to evaluate the feasibility, potential 
>>>> benefits,
>>>>and challenges and we will need involving experts in both Spark and 
>>>> machine
>>>>learning to ensure a successful outcome.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>>> martin.anders...@kambi.com> wrote:
>>>>
>>>>> IMO, using any kind of machine learning or AI for DRA is overkill. The
>>>>> effort involved would be considerable and likely counterproductive,
>>>>> compared to a more conventional approach of comparing the rate of incoming
>>>>> stream data with the effort of handling previous data rates.
>>>>> --
>>>>> *From:* Mich Talebzadeh 
>>>>> *Sent:* Tuesday, August 8, 2023 19:59
>>>>> *To:* Pavan Kotikalapudi 
>>>>> *Cc:* dev@spark.apache.org 
>>>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>>>> [SPARK-24815]
>>>>>
>>>>>
>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>> username or password.
>>>>>
>>>>> I am currently contemplating and sharing my thoughts open

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-23 Thread Pavan Kotikalapudi
mind:
>>>
>>>1.
>>>
>>>*Model Training: *ML models require training and validation using
>>>relevant data. Our DS colleagues need to define appropriate features,
>>>select the right ML algorithms, and fine-tune the model parameters to
>>>achieve optimal performance.
>>>2.
>>>
>>>*Complexity:* Integrating ML adds complexity to our architecture.
>>>Moreover, we need to have the necessary expertise in both Spark 
>>> Structured
>>>Streaming and machine learning to design, implement, and maintain the
>>>system effectively.
>>>3.
>>>
>>>*Resource Overhead: *ML algorithms can be resource-intensive. We
>>>ought to consider the additional computational requirements, especially
>>>during the model training and inference phases.
>>>4.
>>>
>>>In summary, this idea of utilizing ML for capacity planning in Spark
>>>Structured Streaming can possibly hold significant potential for 
>>> improving
>>>system performance and resource utilization. Having said that, I totally
>>>agree that we need to evaluate the feasibility, potential benefits, and
>>>challenges and we will need involving experts in both Spark and machine
>>>learning to ensure a successful outcome.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>>> martin.anders...@kambi.com> wrote:
>>>
>>>> IMO, using any kind of machine learning or AI for DRA is overkill. The
>>>> effort involved would be considerable and likely counterproductive,
>>>> compared to a more conventional approach of comparing the rate of incoming
>>>> stream data with the effort of handling previous data rates.
>>>> --
>>>> *From:* Mich Talebzadeh 
>>>> *Sent:* Tuesday, August 8, 2023 19:59
>>>> *To:* Pavan Kotikalapudi 
>>>> *Cc:* dev@spark.apache.org 
>>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>>> [SPARK-24815]
>>>>
>>>>
>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>> username or password.
>>>>
>>>> I am currently contemplating and sharing my thoughts openly.
>>>> Considering our reliance on previously collected statistics (as mentioned
>>>> earlier), it raises the question of why we couldn't integrate certain
>>>> machine learning elements into Spark Structured Streaming? While this might
>>>> slightly deviate from our current topic, I am not an expert in machine
>>>> learning. However, there are individuals who possess the expertise to
>>>> assist us in exploring this avenue.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>&

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-23 Thread Mich Talebzadeh
n. Having said that, I totally
>>agree that we need to evaluate the feasibility, potential benefits, and
>>challenges and we will need involving experts in both Spark and machine
>>learning to ensure a successful outcome.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Aug 2023 at 14:58, Martin Andersson <
>> martin.anders...@kambi.com> wrote:
>>
>>> IMO, using any kind of machine learning or AI for DRA is overkill. The
>>> effort involved would be considerable and likely counterproductive,
>>> compared to a more conventional approach of comparing the rate of incoming
>>> stream data with the effort of handling previous data rates.
>>> --
>>> *From:* Mich Talebzadeh 
>>> *Sent:* Tuesday, August 8, 2023 19:59
>>> *To:* Pavan Kotikalapudi 
>>> *Cc:* dev@spark.apache.org 
>>> *Subject:* Re: Dynamic resource allocation for structured streaming
>>> [SPARK-24815]
>>>
>>>
>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>> recognize the sender and know the content is safe. DO NOT provide your
>>> username or password.
>>>
>>> I am currently contemplating and sharing my thoughts openly. Considering
>>> our reliance on previously collected statistics (as mentioned earlier), it
>>> raises the question of why we couldn't integrate certain machine learning
>>> elements into Spark Structured Streaming? While this might slightly deviate
>>> from our current topic, I am not an expert in machine learning. However,
>>> there are individuals who possess the expertise to assist us in exploring
>>> this avenue.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi <
>>> pkotikalap...@twilio.com> wrote:
>>>
>>> Listeners are the best resources to the allocation manager  afaik... It
>>> already has SparkListener
>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$>
>>>  that
>>> it utilizes. We can use it to extract more information (like processing
>>> times).
>>> The one with more information regarding streaming query resides in sql
>>> module
>>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/sc

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-19 Thread Pavan Kotikalapudi
 counterproductive,
>> compared to a more conventional approach of comparing the rate of incoming
>> stream data with the effort of handling previous data rates.
>> --
>> *From:* Mich Talebzadeh 
>> *Sent:* Tuesday, August 8, 2023 19:59
>> *To:* Pavan Kotikalapudi 
>> *Cc:* dev@spark.apache.org 
>> *Subject:* Re: Dynamic resource allocation for structured streaming
>> [SPARK-24815]
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> I am currently contemplating and sharing my thoughts openly. Considering
>> our reliance on previously collected statistics (as mentioned earlier), it
>> raises the question of why we couldn't integrate certain machine learning
>> elements into Spark Structured Streaming? While this might slightly deviate
>> from our current topic, I am not an expert in machine learning. However,
>> there are individuals who possess the expertise to assist us in exploring
>> this avenue.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-bP2VmxTg$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-as0BFUVQ$>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
>> wrote:
>>
>> Listeners are the best resources to the allocation manager  afaik... It
>> already has SparkListener
>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala*L640__;Iw!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-YRkCAu0w$>
>>  that
>> it utilizes. We can use it to extract more information (like processing
>> times).
>> The one with more information regarding streaming query resides in sql
>> module
>> <https://urldefense.com/v3/__https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala__;!!NCc8flgU!ag4RKtjaus5ggrkrgIaT1uG75X7gM3CjxLhkaIZMA5VGjc7h7N3BHXkBHRaR3T8ludHCpxKNgQ9ugixgI3MGy-Y_DIYqaw$>
>> though.
>>
>> Thanks
>>
>> Pavan
>>
>> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
>> wrote:
>>
>> Hi Pavan or anyone else
>>
>> Is there any way one access the matrix displayed on SparkGUI? For example
>> the readings for processing time? Can these be acessed?
>>
>> Thanks
>>
>> For example,
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
>> wrote:
>>
>> Thanks for the review Mich,
>>
>> Yes, the configuration parameters we end up setting would be based on the
>>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Mich Talebzadeh
Thank you for your comments.

My vision of integrating machine learning (ML) into Spark Structured
Streaming (SSS) for capacity planning and performance optimization seems to
be promising. By leveraging ML techniques, I believe that we can
potentially create predictive models that enhance the efficiency and
resource allocation of the data processing pipelines. Here are some
potential benefits and considerations for adding ML to SSS for capacity
planning. However, I stand corrected

   1.

   *Predictive Capacity Planning:* ML models can analyze historical data
   (that we discussed already), workloads, and trends to predict future
   resource needs accurately. This enables proactive scaling and allocation of
   resources, ensuring optimal performance during high-demand periods, such as
   times of high trades.
   2.

   *Real-time Decision Making: *ML can be used to make real-time decisions
   on resource allocation (software and cluster) based on current data and
   conditions, allowing for dynamic adjustments to meet the processing demands.
   3.

   *Complex Data Analysis: *In a heterogeneous setup involving multiple
   databases, ML can analyze various factors like data read and write times
   from different databases, data volumes, and data distribution patterns to
   optimize the overall data processing flow.
   4.

   *Anomaly Detection: *ML models can identify unusual patterns or
   performance deviations, alerting us to potential issues before they impact
   the system.
   5.

   Integration with Monitoring: ML models can work alongside monitoring
   tools, gathering real-time data on various performance metrics, and using
   this data for making intelligent decisions on capacity and resource
   allocation.

However, there are some important considerations to keep in mind:

   1.

   *Model Training: *ML models require training and validation using
   relevant data. Our DS colleagues need to define appropriate features,
   select the right ML algorithms, and fine-tune the model parameters to
   achieve optimal performance.
   2.

   *Complexity:* Integrating ML adds complexity to our architecture.
   Moreover, we need to have the necessary expertise in both Spark Structured
   Streaming and machine learning to design, implement, and maintain the
   system effectively.
   3.

   *Resource Overhead: *ML algorithms can be resource-intensive. We ought
   to consider the additional computational requirements, especially during
   the model training and inference phases.
   4.

   In summary, this idea of utilizing ML for capacity planning in Spark
   Structured Streaming can possibly hold significant potential for improving
   system performance and resource utilization. Having said that, I totally
   agree that we need to evaluate the feasibility, potential benefits, and
   challenges and we will need involving experts in both Spark and machine
   learning to ensure a successful outcome.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Aug 2023 at 14:58, Martin Andersson 
wrote:

> IMO, using any kind of machine learning or AI for DRA is overkill. The
> effort involved would be considerable and likely counterproductive,
> compared to a more conventional approach of comparing the rate of incoming
> stream data with the effort of handling previous data rates.
> --
> *From:* Mich Talebzadeh 
> *Sent:* Tuesday, August 8, 2023 19:59
> *To:* Pavan Kotikalapudi 
> *Cc:* dev@spark.apache.org 
> *Subject:* Re: Dynamic resource allocation for structured streaming
> [SPARK-24815]
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> I am currently contemplating and sharing my thoughts openly. Considering
> our reliance on previously collected statistics (as mentioned earlier), it
> raises the question of why we couldn't integrate certain machine learning
> elements into Spark Structured Streaming? While this might slightly deviate
> from our current topic, I am not an expert in machine learning. However,
> there are individuals who possess the expertise to assist us in exploring
> this avenue.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.li

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-14 Thread Martin Andersson
IMO, using any kind of machine learning or AI for DRA is overkill. The effort 
involved would be considerable and likely counterproductive, compared to a more 
conventional approach of comparing the rate of incoming stream data with the 
effort of handling previous data rates.

From: Mich Talebzadeh 
Sent: Tuesday, August 8, 2023 19:59
To: Pavan Kotikalapudi 
Cc: dev@spark.apache.org 
Subject: Re: Dynamic resource allocation for structured streaming [SPARK-24815]


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


I am currently contemplating and sharing my thoughts openly. Considering our 
reliance on previously collected statistics (as mentioned earlier), it raises 
the question of why we couldn't integrate certain machine learning elements 
into Spark Structured Streaming? While this might slightly deviate from our 
current topic, I am not an expert in machine learning. However, there are 
individuals who possess the expertise to assist us in exploring this avenue.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
mailto:pkotikalap...@twilio.com>> wrote:
Listeners are the best resources to the allocation manager  afaik... It already 
has 
SparkListener<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L640>
 that it utilizes. We can use it to extract more information (like processing 
times).
The one with more information regarding streaming query resides in sql 
module<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala>
 though.

Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi Pavan or anyone else

Is there any way one access the matrix displayed on SparkGUI? For example the 
readings for processing time? Can these be acessed?

Thanks

For example,
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
mailto:pkotikalap...@twilio.com>> wrote:
Thanks for the review Mich,

Yes, the configuration parameters we end up setting would be based on the 
trigger interval.

> If you are going to have additional indicators why not look at scheduling 
> delay as well
Yes. The implementation is based on scheduling delays, not for pending tasks of 
the current stage but rather pending tasks of all the stages in a 
micro-batch<https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
 (hence trigger interval).

> we ought to utilise the historical statistics collected under the 
> checkpointing directory to get more accurate statistics
You are right! This is just a simple implementation based on one factor, we 
should also look into other indicators as well If that would help build a 
better scaling algorithm.

Thank you,

Pavan

On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I glanced over the d

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Mich Talebzadeh
I am currently contemplating and sharing my thoughts openly. Considering
our reliance on previously collected statistics (as mentioned earlier), it
raises the question of why we couldn't integrate certain machine learning
elements into Spark Structured Streaming? While this might slightly deviate
from our current topic, I am not an expert in machine learning. However,
there are individuals who possess the expertise to assist us in exploring
this avenue.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
wrote:

> Listeners are the best resources to the allocation manager  afaik... It
> already has SparkListener
> 
>  that
> it utilizes. We can use it to extract more information (like processing
> times).
> The one with more information regarding streaming query resides in sql
> module
> 
> though.
>
> Thanks
>
> Pavan
>
> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
> wrote:
>
>> Hi Pavan or anyone else
>>
>> Is there any way one access the matrix displayed on SparkGUI? For example
>> the readings for processing time? Can these be acessed?
>>
>> Thanks
>>
>> For example,
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
>> wrote:
>>
>>> Thanks for the review Mich,
>>>
>>> Yes, the configuration parameters we end up setting would be based on
>>> the trigger interval.
>>>
>>> > If you are going to have additional indicators why not look at
>>> scheduling delay as well
>>> Yes. The implementation is based on scheduling delays, not for pending
>>> tasks of the current stage but rather pending tasks of all the stages
>>> in a micro-batch
>>> 
>>>  (hence
>>> trigger interval).
>>>
>>> > we ought to utilise the historical statistics collected under the
>>> checkpointing directory to get more accurate statistics
>>> You are right! This is just a simple implementation based on one factor,
>>> we should also look into other indicators as well If that would help build
>>> a better scaling algorithm.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I glanced over the design doc.

 You are providing certain configuration parameters plus some settings
 based on static values. For example:

 spark.dynamicAllocation.schedulerBacklogTimeout": 54s

 I cannot see any use of  which ought to be at least
 half of the batch interval to have the correct margins (confidence level). 
 If
 you are going to have additional indicators why not look at scheduling
 delay as well. Moreover most of the needed statistics are also available to
 set accurate values. My inclination is that this is a great effort but
 we ought to utilise the historical statistics collected under
 checkpointing directory to get more accurate statistics. I will review
 the design document in duew course

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view 

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Pavan Kotikalapudi
Listeners are the best resources to the allocation manager  afaik... It
already has SparkListener

that
it utilizes. We can use it to extract more information (like processing
times).
The one with more information regarding streaming query resides in sql
module

though.

Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
wrote:

> Hi Pavan or anyone else
>
> Is there any way one access the matrix displayed on SparkGUI? For example
> the readings for processing time? Can these be acessed?
>
> Thanks
>
> For example,
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
> wrote:
>
>> Thanks for the review Mich,
>>
>> Yes, the configuration parameters we end up setting would be based on the
>> trigger interval.
>>
>> > If you are going to have additional indicators why not look at
>> scheduling delay as well
>> Yes. The implementation is based on scheduling delays, not for pending
>> tasks of the current stage but rather pending tasks of all the stages in
>> a micro-batch
>> 
>>  (hence
>> trigger interval).
>>
>> > we ought to utilise the historical statistics collected under the
>> checkpointing directory to get more accurate statistics
>> You are right! This is just a simple implementation based on one factor,
>> we should also look into other indicators as well If that would help build
>> a better scaling algorithm.
>>
>> Thank you,
>>
>> Pavan
>>
>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> I glanced over the design doc.
>>>
>>> You are providing certain configuration parameters plus some settings
>>> based on static values. For example:
>>>
>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>
>>> I cannot see any use of  which ought to be at least
>>> half of the batch interval to have the correct margins (confidence level). 
>>> If
>>> you are going to have additional indicators why not look at scheduling
>>> delay as well. Moreover most of the needed statistics are also available to
>>> set accurate values. My inclination is that this is a great effort but
>>> we ought to utilise the historical statistics collected under
>>> checkpointing directory to get more accurate statistics. I will review
>>> the design document in duew course
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>>  wrote:
>>>
 Hi Spark Dev,

 I have extended traditional DRA to work for structured streaming
 use-case.

 Here is an initial Implementation draft PR
 https://github.com/apache/spark/pull/42352
 

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Mich Talebzadeh
Hi Pavan or anyone else

Is there any way one access the matrix displayed on SparkGUI? For example
the readings for processing time? Can these be acessed?

Thanks

For example,
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
wrote:

> Thanks for the review Mich,
>
> Yes, the configuration parameters we end up setting would be based on the
> trigger interval.
>
> > If you are going to have additional indicators why not look at
> scheduling delay as well
> Yes. The implementation is based on scheduling delays, not for pending
> tasks of the current stage but rather pending tasks of all the stages in
> a micro-batch
> 
>  (hence
> trigger interval).
>
> > we ought to utilise the historical statistics collected under the
> checkpointing directory to get more accurate statistics
> You are right! This is just a simple implementation based on one factor,
> we should also look into other indicators as well If that would help build
> a better scaling algorithm.
>
> Thank you,
>
> Pavan
>
> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> I glanced over the design doc.
>>
>> You are providing certain configuration parameters plus some settings
>> based on static values. For example:
>>
>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>
>> I cannot see any use of  which ought to be at least half
>> of the batch interval to have the correct margins (confidence level). If
>> you are going to have additional indicators why not look at scheduling
>> delay as well. Moreover most of the needed statistics are also available to
>> set accurate values. My inclination is that this is a great effort but
>> we ought to utilise the historical statistics collected under
>> checkpointing directory to get more accurate statistics. I will review
>> the design document in duew course
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>  wrote:
>>
>>> Hi Spark Dev,
>>>
>>> I have extended traditional DRA to work for structured streaming
>>> use-case.
>>>
>>> Here is an initial Implementation draft PR
>>> https://github.com/apache/spark/pull/42352
>>> 
>>>  and
>>> design doc:
>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>> 
>>>
>>> Please review and let me know what you think.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-07 Thread Pavan Kotikalapudi
Thanks for the review Mich,

Yes, the configuration parameters we end up setting would be based on the
trigger interval.

> If you are going to have additional indicators why not look at scheduling
delay as well
Yes. The implementation is based on scheduling delays, not for pending
tasks of the current stage but rather pending tasks of all the stages in a
micro-batch

(hence
trigger interval).

> we ought to utilise the historical statistics collected under the
checkpointing directory to get more accurate statistics
You are right! This is just a simple implementation based on one factor, we
should also look into other indicators as well If that would help build a
better scaling algorithm.

Thank you,

Pavan

On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
wrote:

> Hi,
>
> I glanced over the design doc.
>
> You are providing certain configuration parameters plus some settings
> based on static values. For example:
>
> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>
> I cannot see any use of  which ought to be at least half
> of the batch interval to have the correct margins (confidence level). If
> you are going to have additional indicators why not look at scheduling
> delay as well. Moreover most of the needed statistics are also available to
> set accurate values. My inclination is that this is a great effort but we
> ought to utilise the historical statistics collected under checkpointing
> directory to get more accurate statistics. I will review the design
> document in duew course
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>  wrote:
>
>> Hi Spark Dev,
>>
>> I have extended traditional DRA to work for structured streaming
>> use-case.
>>
>> Here is an initial Implementation draft PR
>> https://github.com/apache/spark/pull/42352
>> 
>>  and
>> design doc:
>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>> 
>>
>> Please review and let me know what you think.
>>
>> Thank you,
>>
>> Pavan
>>
>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-07 Thread Mich Talebzadeh
Hi,

I glanced over the design doc.

You are providing certain configuration parameters plus some settings based
on static values. For example:

spark.dynamicAllocation.schedulerBacklogTimeout": 54s

I cannot see any use of  which ought to be at least half
of the batch interval to have the correct margins (confidence level). If
you are going to have additional indicators why not look at scheduling
delay as well. Moreover most of the needed statistics are also available to
set accurate values. My inclination is that this is a great effort but we
ought to utilise the historical statistics collected under checkpointing
directory to get more accurate statistics. I will review the design
document in duew course

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> I have extended traditional DRA to work for structured streaming
> use-case.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
> Please review and let me know what you think.
>
> Thank you,
>
> Pavan
>


Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-07 Thread Holden Karau
Oooh fascinating. I’m going on call this week so it will take me awhile but
I do want to review this :)

On Mon, Aug 7, 2023 at 5:30 PM Pavan Kotikalapudi
 wrote:

> Hi Spark Dev,
>
> I have extended traditional DRA to work for structured streaming
> use-case.
>
> Here is an initial Implementation draft PR
> https://github.com/apache/spark/pull/42352 and design doc:
> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>
> Please review and let me know what you think.
>
> Thank you,
>
> Pavan
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau