Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Mich Talebzadeh
I am currently contemplating and sharing my thoughts openly. Considering
our reliance on previously collected statistics (as mentioned earlier), it
raises the question of why we couldn't integrate certain machine learning
elements into Spark Structured Streaming? While this might slightly deviate
from our current topic, I am not an expert in machine learning. However,
there are individuals who possess the expertise to assist us in exploring
this avenue.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
wrote:

> Listeners are the best resources to the allocation manager  afaik... It
> already has SparkListener
> 
>  that
> it utilizes. We can use it to extract more information (like processing
> times).
> The one with more information regarding streaming query resides in sql
> module
> 
> though.
>
> Thanks
>
> Pavan
>
> On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
> wrote:
>
>> Hi Pavan or anyone else
>>
>> Is there any way one access the matrix displayed on SparkGUI? For example
>> the readings for processing time? Can these be acessed?
>>
>> Thanks
>>
>> For example,
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
>> wrote:
>>
>>> Thanks for the review Mich,
>>>
>>> Yes, the configuration parameters we end up setting would be based on
>>> the trigger interval.
>>>
>>> > If you are going to have additional indicators why not look at
>>> scheduling delay as well
>>> Yes. The implementation is based on scheduling delays, not for pending
>>> tasks of the current stage but rather pending tasks of all the stages
>>> in a micro-batch
>>> 
>>>  (hence
>>> trigger interval).
>>>
>>> > we ought to utilise the historical statistics collected under the
>>> checkpointing directory to get more accurate statistics
>>> You are right! This is just a simple implementation based on one factor,
>>> we should also look into other indicators as well If that would help build
>>> a better scaling algorithm.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I glanced over the design doc.

 You are providing certain configuration parameters plus some settings
 based on static values. For example:

 spark.dynamicAllocation.schedulerBacklogTimeout": 54s

 I cannot see any use of  which ought to be at least
 half of the batch interval to have the correct margins (confidence level). 
 If
 you are going to have additional indicators why not look at scheduling
 delay as well. Moreover most of the needed statistics are also available to
 set accurate values. My inclination is that this is a great effort but
 we ought to utilise the historical statistics collected under
 checkpointing directory to get more accurate statistics. I will review
 the design document in duew course

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view 

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Mich Talebzadeh
Splendid idea. 

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 18:10, Holden Karau  wrote:

> The driver it’s self is probably another topic, perhaps I’ll make a
> “faster spark star time” JIRA and a DA JIRA and we can explore both.
>
> On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh 
> wrote:
>
>> From my own perspective faster execution time especially with Spark on
>> tin boxes (Dataproc & EC2) and Spark on k8s is something that customers
>> often bring up.
>>
>> Poor time to onboard with autoscaling seems to be particularly singled
>> out for heavy ETL jobs that use Spark. I am disappointed to see the poor
>> performance of Spark on k8s autopilot with timelines starting the driver
>> itself and moving from Pending to Running phase (Spark 4.3.1 with Java 11)
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 15:49, kalyan  wrote:
>>
>>> +1 to enhancements in DEA. Long time due!
>>>
>>> There were a few things that I was thinking along the same lines for
>>> some time now(few overlap with @holden 's points)
>>> 1. How to reduce wastage on the RM side? Sometimes the driver asks for
>>> some units of resources. But when RM provisions them, the driver cancels
>>> it.
>>> 2. How to make the resource available when it is needed.
>>> 3. Cost Vs AppRunTime: A good DEA algo should allow the developer to
>>> choose between cost and runtime. Sometimes developers might be ok to pay
>>> higher costs for faster execution.
>>> 4. Stitch resource profile choices into query execution.
>>> 5. Allow different DEA algo to be chosen for different queries within
>>> the same spark application.
>>> 6. Fall back to default algo, when things go haywire!
>>>
>>> Model-based learning would be awesome.
>>> These can be fine-tuned with some tools like sparklens.
>>>
>>> I am aware of a few experiments carried out in this area by my friends
>>> in this domain. One lesson we had was, it is hard to have a generic
>>> algorithm that worked for all cases.
>>>
>>> Regards
>>> kalyan.
>>>
>>>
>>> On Tue, Aug 8, 2023 at 6:12 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks for pointing out this feature to me. I will have a look when I
 get there.

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫)  wrote:

> Spark 3.5 have added an method `supportsReliableStorage`  in the `
> ShuffleDriverComponents` which indicate whether writing  shuffle data
> to a distributed filesystem or persisting it in a remote shuffle service.
>
> Uniffle is a general purpose remote shuffle service (
> https://github.com/apache/incubator-uniffle).  It can enhance the
> experience of Spark on K8S. After Spark 3.5 is released, Uniffle will
> support the `ShuffleDriverComponents`.  you can see [1].
>
> If you have interest about more details of Uniffle, you can  see [2]
>
>
> [1] https://github.com/apache/incubator-uniffle/issues/802.
>
> [2]
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
>
>
> *发件人**: *Mich Talebzadeh 
> *日期**: *2023年8月8日 星期二 06:53
> *抄送**: *dev 
> *主题**: *[Internet]Re: Improving Dynamic Allocation 

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Holden Karau
The driver it’s self is probably another topic, perhaps I’ll make a “faster
spark star time” JIRA and a DA JIRA and we can explore both.

On Tue, Aug 8, 2023 at 10:07 AM Mich Talebzadeh 
wrote:

> From my own perspective faster execution time especially with Spark on tin
> boxes (Dataproc & EC2) and Spark on k8s is something that customers often
> bring up.
>
> Poor time to onboard with autoscaling seems to be particularly singled out
> for heavy ETL jobs that use Spark. I am disappointed to see the poor
> performance of Spark on k8s autopilot with timelines starting the driver
> itself and moving from Pending to Running phase (Spark 4.3.1 with Java 11)
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 15:49, kalyan  wrote:
>
>> +1 to enhancements in DEA. Long time due!
>>
>> There were a few things that I was thinking along the same lines for some
>> time now(few overlap with @holden 's points)
>> 1. How to reduce wastage on the RM side? Sometimes the driver asks for
>> some units of resources. But when RM provisions them, the driver cancels
>> it.
>> 2. How to make the resource available when it is needed.
>> 3. Cost Vs AppRunTime: A good DEA algo should allow the developer to
>> choose between cost and runtime. Sometimes developers might be ok to pay
>> higher costs for faster execution.
>> 4. Stitch resource profile choices into query execution.
>> 5. Allow different DEA algo to be chosen for different queries within the
>> same spark application.
>> 6. Fall back to default algo, when things go haywire!
>>
>> Model-based learning would be awesome.
>> These can be fine-tuned with some tools like sparklens.
>>
>> I am aware of a few experiments carried out in this area by my friends in
>> this domain. One lesson we had was, it is hard to have a generic algorithm
>> that worked for all cases.
>>
>> Regards
>> kalyan.
>>
>>
>> On Tue, Aug 8, 2023 at 6:12 PM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for pointing out this feature to me. I will have a look when I
>>> get there.
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫)  wrote:
>>>
 Spark 3.5 have added an method `supportsReliableStorage`  in the `
 ShuffleDriverComponents` which indicate whether writing  shuffle data
 to a distributed filesystem or persisting it in a remote shuffle service.

 Uniffle is a general purpose remote shuffle service (
 https://github.com/apache/incubator-uniffle).  It can enhance the
 experience of Spark on K8S. After Spark 3.5 is released, Uniffle will
 support the `ShuffleDriverComponents`.  you can see [1].

 If you have interest about more details of Uniffle, you can  see [2]


 [1] https://github.com/apache/incubator-uniffle/issues/802.

 [2]
 https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era



 *发件人**: *Mich Talebzadeh 
 *日期**: *2023年8月8日 星期二 06:53
 *抄送**: *dev 
 *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+



 On the subject of dynamic allocation, is the following message a cause
 for concern when running Spark on k8s?



 INFO ExecutorAllocationManager: Dynamic allocation is enabled without a
 shuffle service.


 Mich Talebzadeh,

 Solutions Architect/Engineering Lead

 London

 United Kingdom



view my Linkedin profile
 



  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's 

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Mich Talebzadeh
>From my own perspective faster execution time especially with Spark on tin
boxes (Dataproc & EC2) and Spark on k8s is something that customers often
bring up.

Poor time to onboard with autoscaling seems to be particularly singled out
for heavy ETL jobs that use Spark. I am disappointed to see the poor
performance of Spark on k8s autopilot with timelines starting the driver
itself and moving from Pending to Running phase (Spark 4.3.1 with Java 11)

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 15:49, kalyan  wrote:

> +1 to enhancements in DEA. Long time due!
>
> There were a few things that I was thinking along the same lines for some
> time now(few overlap with @holden 's points)
> 1. How to reduce wastage on the RM side? Sometimes the driver asks for
> some units of resources. But when RM provisions them, the driver cancels
> it.
> 2. How to make the resource available when it is needed.
> 3. Cost Vs AppRunTime: A good DEA algo should allow the developer to
> choose between cost and runtime. Sometimes developers might be ok to pay
> higher costs for faster execution.
> 4. Stitch resource profile choices into query execution.
> 5. Allow different DEA algo to be chosen for different queries within the
> same spark application.
> 6. Fall back to default algo, when things go haywire!
>
> Model-based learning would be awesome.
> These can be fine-tuned with some tools like sparklens.
>
> I am aware of a few experiments carried out in this area by my friends in
> this domain. One lesson we had was, it is hard to have a generic algorithm
> that worked for all cases.
>
> Regards
> kalyan.
>
>
> On Tue, Aug 8, 2023 at 6:12 PM Mich Talebzadeh 
> wrote:
>
>> Thanks for pointing out this feature to me. I will have a look when I get
>> there.
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫)  wrote:
>>
>>> Spark 3.5 have added an method `supportsReliableStorage`  in the `
>>> ShuffleDriverComponents` which indicate whether writing  shuffle data
>>> to a distributed filesystem or persisting it in a remote shuffle service.
>>>
>>> Uniffle is a general purpose remote shuffle service (
>>> https://github.com/apache/incubator-uniffle).  It can enhance the
>>> experience of Spark on K8S. After Spark 3.5 is released, Uniffle will
>>> support the `ShuffleDriverComponents`.  you can see [1].
>>>
>>> If you have interest about more details of Uniffle, you can  see [2]
>>>
>>>
>>> [1] https://github.com/apache/incubator-uniffle/issues/802.
>>>
>>> [2]
>>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>>
>>>
>>>
>>> *发件人**: *Mich Talebzadeh 
>>> *日期**: *2023年8月8日 星期二 06:53
>>> *抄送**: *dev 
>>> *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+
>>>
>>>
>>>
>>> On the subject of dynamic allocation, is the following message a cause
>>> for concern when running Spark on k8s?
>>>
>>>
>>>
>>> INFO ExecutorAllocationManager: Dynamic allocation is enabled without a
>>> shuffle service.
>>>
>>>
>>> Mich Talebzadeh,
>>>
>>> Solutions Architect/Engineering Lead
>>>
>>> London
>>>
>>> United Kingdom
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh 
>>> wrote:
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> From what I have seen spark on a serverless cluster has hard up getting
>>> the 

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Pavan Kotikalapudi
Listeners are the best resources to the allocation manager  afaik... It
already has SparkListener

that
it utilizes. We can use it to extract more information (like processing
times).
The one with more information regarding streaming query resides in sql
module

though.

Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
wrote:

> Hi Pavan or anyone else
>
> Is there any way one access the matrix displayed on SparkGUI? For example
> the readings for processing time? Can these be acessed?
>
> Thanks
>
> For example,
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
> wrote:
>
>> Thanks for the review Mich,
>>
>> Yes, the configuration parameters we end up setting would be based on the
>> trigger interval.
>>
>> > If you are going to have additional indicators why not look at
>> scheduling delay as well
>> Yes. The implementation is based on scheduling delays, not for pending
>> tasks of the current stage but rather pending tasks of all the stages in
>> a micro-batch
>> 
>>  (hence
>> trigger interval).
>>
>> > we ought to utilise the historical statistics collected under the
>> checkpointing directory to get more accurate statistics
>> You are right! This is just a simple implementation based on one factor,
>> we should also look into other indicators as well If that would help build
>> a better scaling algorithm.
>>
>> Thank you,
>>
>> Pavan
>>
>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> I glanced over the design doc.
>>>
>>> You are providing certain configuration parameters plus some settings
>>> based on static values. For example:
>>>
>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>
>>> I cannot see any use of  which ought to be at least
>>> half of the batch interval to have the correct margins (confidence level). 
>>> If
>>> you are going to have additional indicators why not look at scheduling
>>> delay as well. Moreover most of the needed statistics are also available to
>>> set accurate values. My inclination is that this is a great effort but
>>> we ought to utilise the historical statistics collected under
>>> checkpointing directory to get more accurate statistics. I will review
>>> the design document in duew course
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>>  wrote:
>>>
 Hi Spark Dev,

 I have extended traditional DRA to work for structured streaming
 use-case.

 Here is an initial Implementation draft PR
 https://github.com/apache/spark/pull/42352
 

Re: ASF board report draft for August 2023

2023-08-08 Thread Holden Karau
Maybe add a link to the 4.0 JIRA where we are tracking the current plans
for 4.0?

On Tue, Aug 8, 2023 at 9:33 AM Dongjoon Hyun 
wrote:

> Thank you, Matei.
>
> It looks good to me.
>
> Dongjoon
>
> On Mon, Aug 7, 2023 at 22:54 Matei Zaharia 
> wrote:
>
>> It’s time to send our quarterly report to the ASF board on August 9th.
>> Here’s what I wrote as a draft — feel free to suggest changes.
>>
>> =
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We cut the branch Spark 3.5.0 on July 17th 2023. The community is
>> working on bug fixes, tests, stability and documentation.
>> - We made a patch release, Spark 3.4.1, on June 23, 2023.
>> - We are preparing a Spark 3.3.3 release for later this month (
>> https://lists.apache.org/thread/0kgnw8njjnfgc5nghx60mn7oojvrqwj7).
>> - Votes on three Spark Project Improvement Proposals (SPIP) passed: "XML
>> data source support", "Python Data Source API", and "PySpark Test
>> Framework".
>> - A vote for "Apache Spark PMC asks Databricks to differentiate its Spark
>> version string" did not pass. This was asking a company to change the
>> string returned by Spark APIs in a product that packages a modified version
>> of Apache Spark.
>> - The community decided to release Apache Spark 4.0.0 after the 3.5.0
>> version.
>> - An official Apache Spark Docker image is now available at
>> https://hub.docker.com/_/spark
>> - A new repository, https://github.com/apache/spark-connect-go, was
>> created for the Go client of Spark Connect.
>> - The PMC voted to add two new committers to the project, XiDuo You and
>> Peter Toth
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>>
>> - We released Apache Spark 3.4.1 on June 23, 2023
>> - We released Apache Spark 3.2.4 on April 13, 2023
>> - We released Spark 3.3.2 on February 17, 2023
>>
>> Committers and PMC:
>>
>> - The latest committers were added on July 11th, 2023 (XiDuo You and
>> Peter Toth).
>> - The latest PMC members were added on May 10th, 2023 (Chao Sun, Xinrong
>> Meng and Ruifeng Zheng).
>>
>> =
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: ASF board report draft for August 2023

2023-08-08 Thread Dongjoon Hyun
Thank you, Matei.

It looks good to me.

Dongjoon

On Mon, Aug 7, 2023 at 22:54 Matei Zaharia  wrote:

> It’s time to send our quarterly report to the ASF board on August 9th.
> Here’s what I wrote as a draft — feel free to suggest changes.
>
> =
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We cut the branch Spark 3.5.0 on July 17th 2023. The community is
> working on bug fixes, tests, stability and documentation.
> - We made a patch release, Spark 3.4.1, on June 23, 2023.
> - We are preparing a Spark 3.3.3 release for later this month (
> https://lists.apache.org/thread/0kgnw8njjnfgc5nghx60mn7oojvrqwj7).
> - Votes on three Spark Project Improvement Proposals (SPIP) passed: "XML
> data source support", "Python Data Source API", and "PySpark Test
> Framework".
> - A vote for "Apache Spark PMC asks Databricks to differentiate its Spark
> version string" did not pass. This was asking a company to change the
> string returned by Spark APIs in a product that packages a modified version
> of Apache Spark.
> - The community decided to release Apache Spark 4.0.0 after the 3.5.0
> version.
> - An official Apache Spark Docker image is now available at
> https://hub.docker.com/_/spark
> - A new repository, https://github.com/apache/spark-connect-go, was
> created for the Go client of Spark Connect.
> - The PMC voted to add two new committers to the project, XiDuo You and
> Peter Toth
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
>
> - We released Apache Spark 3.4.1 on June 23, 2023
> - We released Apache Spark 3.2.4 on April 13, 2023
> - We released Spark 3.3.2 on February 17, 2023
>
> Committers and PMC:
>
> - The latest committers were added on July 11th, 2023 (XiDuo You and Peter
> Toth).
> - The latest PMC members were added on May 10th, 2023 (Chao Sun, Xinrong
> Meng and Ruifeng Zheng).
>
> =


Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread kalyan
+1 to enhancements in DEA. Long time due!

There were a few things that I was thinking along the same lines for some
time now(few overlap with @holden 's points)
1. How to reduce wastage on the RM side? Sometimes the driver asks for some
units of resources. But when RM provisions them, the driver cancels it.
2. How to make the resource available when it is needed.
3. Cost Vs AppRunTime: A good DEA algo should allow the developer to choose
between cost and runtime. Sometimes developers might be ok to pay higher
costs for faster execution.
4. Stitch resource profile choices into query execution.
5. Allow different DEA algo to be chosen for different queries within the
same spark application.
6. Fall back to default algo, when things go haywire!

Model-based learning would be awesome.
These can be fine-tuned with some tools like sparklens.

I am aware of a few experiments carried out in this area by my friends in
this domain. One lesson we had was, it is hard to have a generic algorithm
that worked for all cases.

Regards
kalyan.


On Tue, Aug 8, 2023 at 6:12 PM Mich Talebzadeh 
wrote:

> Thanks for pointing out this feature to me. I will have a look when I get
> there.
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫)  wrote:
>
>> Spark 3.5 have added an method `supportsReliableStorage`  in the `
>> ShuffleDriverComponents` which indicate whether writing  shuffle data to
>> a distributed filesystem or persisting it in a remote shuffle service.
>>
>> Uniffle is a general purpose remote shuffle service (
>> https://github.com/apache/incubator-uniffle).  It can enhance the
>> experience of Spark on K8S. After Spark 3.5 is released, Uniffle will
>> support the `ShuffleDriverComponents`.  you can see [1].
>>
>> If you have interest about more details of Uniffle, you can  see [2]
>>
>>
>> [1] https://github.com/apache/incubator-uniffle/issues/802.
>>
>> [2]
>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>
>>
>>
>> *发件人**: *Mich Talebzadeh 
>> *日期**: *2023年8月8日 星期二 06:53
>> *抄送**: *dev 
>> *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+
>>
>>
>>
>> On the subject of dynamic allocation, is the following message a cause
>> for concern when running Spark on k8s?
>>
>>
>>
>> INFO ExecutorAllocationManager: Dynamic allocation is enabled without a
>> shuffle service.
>>
>>
>> Mich Talebzadeh,
>>
>> Solutions Architect/Engineering Lead
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>>
>> On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh 
>> wrote:
>>
>>
>>
>> Hi,
>>
>>
>>
>> From what I have seen spark on a serverless cluster has hard up getting
>> the driver going in a timely manner
>>
>>
>>
>> Annotations:  autopilot.gke.io/resource-adjustment:
>>
>>
>> {"input":{"containers":[{"limits":{"memory":"1433Mi"},"requests":{"cpu":"1","memory":"1433Mi"},"name":"spark-kubernetes-driver"}]},"output...
>>
>>   autopilot.gke.io/warden-version: 2.7.41
>>
>>
>>
>> This is on spark 3.4.1 with Java 11 both the host running spark-submit
>> and the docker itself
>>
>>
>>
>> I am not sure how relevant this is to this discussion but it looks like a
>> kind of blocker for now. What config params can help here and what can be
>> done?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Solutions Architect/Engineering Lead
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or 

Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Thomas Graves
> > - Advisory user input (e.g. a way to say after X is done I know I need Y 
> > where Y might be a bunch of GPU machines)

You are thinking of something more advanced than the Stage Level
Scheduling?  Or perhaps configured differently or prestarting things
you know you will need?

Tom

On Mon, Aug 7, 2023 at 3:27 PM Holden Karau  wrote:
>
> So I wondering if there is interesting in revisiting some of how Spark is 
> doing it's dynamica allocation for Spark 4+?
>
> Some things that I've been thinking about:
>
> - Advisory user input (e.g. a way to say after X is done I know I need Y 
> where Y might be a bunch of GPU machines)
> - Configurable tolerance (e.g. if we have at most Z% over target no-op)
> - Past runs of same job (e.g. stage X of job Y had a peak of K)
> - Faster executor launches (I'm a little fuzzy on what we can do here but, 
> one area for example is we setup and tear down an RPC connection to the 
> driver with a blocking call which does seem to have some locking inside of 
> the driver at first glance)
>
> Is this an area other folks are thinking about? Should I make an epic we can 
> track ideas in? Or are folks generally happy with today's dynamic allocation 
> (or just busy with other things)?
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Dynamic resource allocation for structured streaming [SPARK-24815]

2023-08-08 Thread Mich Talebzadeh
Hi Pavan or anyone else

Is there any way one access the matrix displayed on SparkGUI? For example
the readings for processing time? Can these be acessed?

Thanks

For example,
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
wrote:

> Thanks for the review Mich,
>
> Yes, the configuration parameters we end up setting would be based on the
> trigger interval.
>
> > If you are going to have additional indicators why not look at
> scheduling delay as well
> Yes. The implementation is based on scheduling delays, not for pending
> tasks of the current stage but rather pending tasks of all the stages in
> a micro-batch
> 
>  (hence
> trigger interval).
>
> > we ought to utilise the historical statistics collected under the
> checkpointing directory to get more accurate statistics
> You are right! This is just a simple implementation based on one factor,
> we should also look into other indicators as well If that would help build
> a better scaling algorithm.
>
> Thank you,
>
> Pavan
>
> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> I glanced over the design doc.
>>
>> You are providing certain configuration parameters plus some settings
>> based on static values. For example:
>>
>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>
>> I cannot see any use of  which ought to be at least half
>> of the batch interval to have the correct margins (confidence level). If
>> you are going to have additional indicators why not look at scheduling
>> delay as well. Moreover most of the needed statistics are also available to
>> set accurate values. My inclination is that this is a great effort but
>> we ought to utilise the historical statistics collected under
>> checkpointing directory to get more accurate statistics. I will review
>> the design document in duew course
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>  wrote:
>>
>>> Hi Spark Dev,
>>>
>>> I have extended traditional DRA to work for structured streaming
>>> use-case.
>>>
>>> Here is an initial Implementation draft PR
>>> https://github.com/apache/spark/pull/42352
>>> 
>>>  and
>>> design doc:
>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>> 
>>>
>>> Please review and let me know what you think.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-08 Thread Yuming Wang
-1. I found a NoClassDefFoundError bug:
https://issues.apache.org/jira/browse/SPARK-44719.

On Mon, Aug 7, 2023 at 11:24 AM yangjie01 
wrote:

>
>
> I submitted a PR last week to try and solve this issue:
> https://github.com/apache/spark/pull/42236.
>
>
>
> *发件人**: *Sean Owen 
> *日期**: *2023年8月7日 星期一 11:05
> *收件人**: *Yuanjian Li 
> *抄送**: *Spark dev list 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC1)
>
>
> --
>
> *【外部邮件】信息安全要牢记,账号密码不传递!*
> --
>
>
>
> Let's keep testing 3.5.0 of course while that change is going in. (See
> https://github.com/apache/spark/pull/42364#issuecomment-1666878287
> 
> )
>
>
>
> Otherwise testing is pretty much as usual, except I get this test failure
> in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
> 12.
>
>
>
> - from_protobuf_messageClassName_options *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
> Could not load Protobuf class with name
> org.apache.spark.connect.proto.StorageLevel.
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
> with Protobuf classes needs to be shaded (com.google.protobuf.* -->
> org.sparkproject.spark_protobuf.protobuf.*).
>   at
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
>   at
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
>   at
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:156)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:73)
>   at scala.collection.immutable.List.map(List.scala:246)
>
>
>
> On Sat, Aug 5, 2023 at 5:42 PM Sean Owen  wrote:
>
> I'm still testing other combinations, but it looks like tests fail on Java
> 17 after building with Java 8, which should be a normal supported
> configuration.
>
> This is described at https://github.com/apache/spark/pull/41943
> 
> and looks like it is resolved by moving back to Scala 2.13.8 for now.
>
> Unless I'm missing something we need to fix this for 3.5 or it's not clear
> the build will run on Java 17.
>
>
>
> On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:
>
> Please vote on releasing the following candidate(RC1) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Aug 9th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
>
>
>
> The tag to be voted on is v3.5.0-rc1 (commit
> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>
> https://github.com/apache/spark/tree/v3.5.0-rc1
> 
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
> 
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1444
> 
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
> 

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread Mich Talebzadeh
Thanks for pointing out this feature to me. I will have a look when I get
there.

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 8 Aug 2023 at 11:44, roryqi(齐赫)  wrote:

> Spark 3.5 have added an method `supportsReliableStorage`  in the `
> ShuffleDriverComponents` which indicate whether writing  shuffle data to
> a distributed filesystem or persisting it in a remote shuffle service.
>
> Uniffle is a general purpose remote shuffle service (
> https://github.com/apache/incubator-uniffle).  It can enhance the
> experience of Spark on K8S. After Spark 3.5 is released, Uniffle will
> support the `ShuffleDriverComponents`.  you can see [1].
>
> If you have interest about more details of Uniffle, you can  see [2]
>
>
> [1] https://github.com/apache/incubator-uniffle/issues/802.
>
> [2]
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
>
>
> *发件人**: *Mich Talebzadeh 
> *日期**: *2023年8月8日 星期二 06:53
> *抄送**: *dev 
> *主题**: *[Internet]Re: Improving Dynamic Allocation Logic for Spark 4+
>
>
>
> On the subject of dynamic allocation, is the following message a cause for
> concern when running Spark on k8s?
>
>
>
> INFO ExecutorAllocationManager: Dynamic allocation is enabled without a
> shuffle service.
>
>
> Mich Talebzadeh,
>
> Solutions Architect/Engineering Lead
>
> London
>
> United Kingdom
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh 
> wrote:
>
>
>
> Hi,
>
>
>
> From what I have seen spark on a serverless cluster has hard up getting
> the driver going in a timely manner
>
>
>
> Annotations:  autopilot.gke.io/resource-adjustment:
>
>
> {"input":{"containers":[{"limits":{"memory":"1433Mi"},"requests":{"cpu":"1","memory":"1433Mi"},"name":"spark-kubernetes-driver"}]},"output...
>
>   autopilot.gke.io/warden-version: 2.7.41
>
>
>
> This is on spark 3.4.1 with Java 11 both the host running spark-submit and
> the docker itself
>
>
>
> I am not sure how relevant this is to this discussion but it looks like a
> kind of blocker for now. What config params can help here and what can be
> done?
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh,
>
> Solutions Architect/Engineering Lead
>
> London
>
> United Kingdom
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Mon, 7 Aug 2023 at 22:39, Holden Karau  wrote:
>
> Oh great point
>
>
>
> On Mon, Aug 7, 2023 at 2:23 PM bo yang  wrote:
>
> Thanks Holden for bringing this up!
>
>
>
> Maybe another thing to think about is how to make dynamic allocation more
> friendly with Kubernetes and disaggregated shuffle storage?
>
>
>
>
>
>
>
> On Mon, Aug 7, 2023 at 1:27 PM Holden Karau  wrote:
>
> So I wondering if there is interesting in revisiting some of how Spark is
> doing it's dynamica allocation for Spark 4+?
>
>
>
> Some things that I've been thinking about:
>
>
>
> - Advisory user input (e.g. a way to say after X is done I know I need Y
> where Y might be a bunch of GPU machines)
>
> - Configurable tolerance (e.g. if we have at most Z% over target no-op)
>
> - Past runs of same job (e.g. stage X of job Y had a peak of K)
>
> - Faster executor launches (I'm a little fuzzy on what we can do here but,
> one area for example is we setup and tear down an RPC connection to the
> driver with a blocking call which does seem to have some locking inside of
> the driver at first glance)
>
>
>
> Is this an area other folks are thinking about? Should I make an epic we
> can track ideas in? Or are folks generally happy with today's dynamic
> allocation 

Re: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

2023-08-08 Thread 齐赫
Spark 3.5 have added an method `supportsReliableStorage`  in the 
`ShuffleDriverComponents` which indicate whether writing  shuffle data to a 
distributed filesystem or persisting it in a remote shuffle service.
Uniffle is a general purpose remote shuffle service 
(https://github.com/apache/incubator-uniffle).  It can enhance the experience 
of Spark on K8S. After Spark 3.5 is released, Uniffle will support the 
`ShuffleDriverComponents`.  you can see [1].
If you have interest about more details of Uniffle, you can  see [2]

[1] https://github.com/apache/incubator-uniffle/issues/802.
[2] 
https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era

发件人: Mich Talebzadeh 
日期: 2023年8月8日 星期二 06:53
抄送: dev 
主题: [Internet]Re: Improving Dynamic Allocation Logic for Spark 4+

On the subject of dynamic allocation, is the following message a cause for 
concern when running Spark on k8s?

INFO ExecutorAllocationManager: Dynamic allocation is enabled without a shuffle 
service.

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Mon, 7 Aug 2023 at 23:42, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:

Hi,

From what I have seen spark on a serverless cluster has hard up getting the 
driver going in a timely manner

Annotations:  
autopilot.gke.io/resource-adjustment:

{"input":{"containers":[{"limits":{"memory":"1433Mi"},"requests":{"cpu":"1","memory":"1433Mi"},"name":"spark-kubernetes-driver"}]},"output...
  
autopilot.gke.io/warden-version: 2.7.41

This is on spark 3.4.1 with Java 11 both the host running spark-submit and the 
docker itself

I am not sure how relevant this is to this discussion but it looks like a kind 
of blocker for now. What config params can help here and what can be done?

Thanks

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Mon, 7 Aug 2023 at 22:39, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Oh great point

On Mon, Aug 7, 2023 at 2:23 PM bo yang 
mailto:bobyan...@gmail.com>> wrote:
Thanks Holden for bringing this up!

Maybe another thing to think about is how to make dynamic allocation more 
friendly with Kubernetes and disaggregated shuffle storage?



On Mon, Aug 7, 2023 at 1:27 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
So I wondering if there is interesting in revisiting some of how Spark is doing 
it's dynamica allocation for Spark 4+?

Some things that I've been thinking about:

- Advisory user input (e.g. a way to say after X is done I know I need Y where 
Y might be a bunch of GPU machines)
- Configurable tolerance (e.g. if we have at most Z% over target no-op)
- Past runs of same job (e.g. stage X of job Y had a peak of K)
- Faster executor launches (I'm a little fuzzy on what we can do here but, one 
area for example is we setup and tear down an RPC connection to the driver with 
a blocking call which does seem to have some locking inside of the driver at 
first glance)

Is this an area other folks are thinking about? Should I make an epic we can 
track ideas in? Or are folks generally happy with today's dynamic allocation 
(or just busy with other things)?

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau
--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 

YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
What do you think about removing HiveContext and even SQLContext?

And as an extension of this question, should we re-implement the Hive using 
DSv2 API in Spark 4?

For developers who want to implement a custom DataSource plugin, he/she may 
want to learn something from the Spark built-in one[1], and Hive is a good 
candidate. A kind of legacy implementation may confuse the developers.

It was discussed/requested in [2][3][4][5]

There were some requests for multiple Hive metastores support[6], and I have 
experienced that users choose Presto/Trino instead of Spark because the former 
supports multi HMS.

BTW, there are known third-party Hive DSv2 implementations[7][8].

[1] https://www.mail-archive.com/dev@spark.apache.org/msg30353.html
[2] https://www.mail-archive.com/dev@spark.apache.org/msg25715.html
[3] https://issues.apache.org/jira/browse/SPARK-31241
[4] https://issues.apache.org/jira/browse/SPARK-39797
[5] https://issues.apache.org/jira/browse/SPARK-44518
[6] https://www.mail-archive.com/dev@spark.apache.org/msg30228.html
[7] https://github.com/permanentstar/spark-sql-dsv2-extension
[8] 
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


> On Aug 8, 2023, at 10:09, Wenchen Fan  wrote:
> 
> I think the principle is we should remove things that block us from 
> supporting new things like Java 21, or come with a significant maintenance 
> cost. If there is no benefit to removing deprecated APIs (just to keep the 
> codebase clean?), I'd prefer to leave them there and not bother.
> 
> On Tue, Aug 8, 2023 at 9:00 AM Jia Fan  wrote:
> Thanks Sean  for open this discussion.
> 
> 1. I think drop Scala 2.12 is a good option.
> 
> 2. Personally, I think we should remove most methods that are deprecated 
> since 2.x/1.x unless it can't find a good replacement. There is already a 3.x 
> version as a buffer and I don't think it is good practice to use the 
> deprecated method of 2.x on 4.x.
> 
> 3. For Mesos, I think we should remove it from doc first.
> 
> 
> Jia Fan
> 
> 
> 
>> 2023年8月8日 05:47,Sean Owen  写道:
>> 
>> While we're noodling on the topic, what else might be worth removing in 
>> Spark 4?
>> 
>> For example, looks like we're finally hitting problems supporting Java 8 
>> through 21 all at once, related to Scala 2.13.x updates. It would be 
>> reasonable to require Java 11, or even 17, as a baseline for the multi-year 
>> lifecycle of Spark 4.
>> 
>> Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard 
>> otherwise.
>> 
>> There was a good discussion about whether old deprecated methods should be 
>> removed. They can't be removed at other times, but, doesn't mean they all 
>> should be. createExternalTable was brought up as a first example. What 
>> deprecated methods are worth removing?
>> 
>> There's Mesos support, long since deprecated, which seems like something to 
>> prune.
>> 
>> Are there old Hive/Hadoop version combos we should just stop supporting?
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What else could be removed in Spark 4?

2023-08-08 Thread Cheng Pan
> Are there old Hive/Hadoop version combos we should just stop supporting?

Dropping support for Java 8 means dropping support for Hive lower than 
2.0(exclusive)[1].

IsolatedClientLoader is aimed to allow using different Hive jars to communicate 
with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client 
works well on communicating with the Hive Metastore server through 2.1 to 3.1 
(maybe 2.0 too, not sure). This brings a new question, does the 
IsolatedClientLoader required then?

I think we should drop IsolatedClientLoader because

1. As explained above, we can use the built-in Hive 2.3.9 client to communicate 
with HMS 2.1+
2. Since SPARK-42539[2], the default Hive 2.3.9 client does not use 
IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader 
causes some inconsistent behaviors.
3. It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) 
compatible with all Guava 14+ versions, but unfortunately, Guava is marked as 
`isSharedClass`[3] in IsolatedClientLoader, so technically, if we want to 
upgrade Guava we need to make all supported Hive versions(through 2.1.x to 
3.1.x) to support high version of Guava, I think it's impossible.

[1] 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala
[2] https://issues.apache.org/jira/browse/SPARK-42539
[3] https://issues.apache.org/jira/browse/HIVE-27560
[4] https://github.com/apache/spark/pull/33989#issuecomment-926277286

Thanks,
Cheng Pan


> On Aug 8, 2023, at 05:47, Sean Owen  wrote:
> 
> Are there old Hive/Hadoop version combos we should just stop supporting?



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org