from:"Mridul Muralidharan"

Re: [VOTE] Release Apache Celeborn 0.4.1-rc1

2024-05-20 Thread Mridul Muralidharan

+1

Signatures, digests, etc check out fine.
Checked out tag and build/tested with "-Pspark3.1"

Regards,
Mridul


On Sun, May 19, 2024 at 10:19 PM rexxiong  wrote:

> +1 (binding)
> I checked
> - Download links are valid.
> - git commit hash is correct
> - Checksums and signatures are valid.
> - No binary files in the source release
> - Successfully built the binary from the source on MacOs with Command:
> ./build/make-distribution.sh -Pspark-3.3
>
> I also tested compatibility with version 0.4.0 by upgrading the
> master/worker from 0.4.0 to 0.4.1. Using a 0.4.0 client to access the 0.4.1
> master/worker, everything worked well.
>
> Thanks,
> Jiashu Xiong
>
> Yihe Li  于2024年5月17日周五 18:47写道：
>
> > +1 (non-binding)
> > I checked the following things:
> > - git commit hash is correct.
> > - download links are valid.
> > - release files are in correct location.
> > - signatures and checksums are good.
> > - LICENSE and NOTICE files exist.
> > - build success from source code(ubuntu 16.04).
> > ```
> > ./build/make-distribution.sh --sbt-enabled -Pspark-3.3
> > ```
> >
> > Thanks,
> > Yihe Li
> >
> > On 2024/05/17 01:53:48 angers zhu wrote:
> > > +1
> > >
> > > - Checked license
> > > - checked doc
> > > - checked build from source with spark-32
> > >
> > > Nicholas Jiang  于2024年5月14日周二 12:13写道：
> > >
> > > > Hi Celeborn community,
> > > >
> > > > This is a call for a vote to release Apache Celeborn
> > > >
> > > > 0.4.1-rc1
> > > >
> > > >
> > > > The git tag to be voted upon:
> > > >
> > > > https://github.com/apache/celeborn/releases/tag/v0.4.1-rc1
> > > >
> > > > The git commit hash:
> > > > 641180142c5ef36430a6afcd702c9487a6007458 source and binary artifacts
> > can be
> > > > found at:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/celeborn/v0.4.1-rc1
> > > >
> > > > The staging repo:
> > > >
> > > >
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1055
> > > >
> > > >
> > > > Fingerprint of the PGP key release artifacts are signed with:
> > > > D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> > > >
> > > > My public key to verify signatures can be found in:
> > > >
> > > > https://dist.apache.org/repos/dist/release/celeborn/KEYS
> > > >
> > > > The vote will be open for at least 72 hours or until the necessary
> > > > number of votes are reached.
> > > >
> > > > Please vote accordingly:
> > > >
> > > > [ ] +1 approve
> > > > [ ] +0 no opinion
> > > > [ ] -1 disapprove (and the reason)
> > > >
> > > > Steps to validate the release:
> > > >
> > > > https://www.apache.org/info/verification.html
> > > >
> > > > * Download links, checksums and PGP signatures are valid.
> > > > * Source code distributions have correct names matching the current
> > > > release.
> > > > * LICENSE and NOTICE files are correct.
> > > > * All files have license headers if necessary.
> > > > * No unlicensed compiled archives bundled in source archive.
> > > > * The source tarball matches the git tag.
> > > > * Build from source is successful.
> > > >
> > > > Regards,
> > > > Nicholas Jiang
> > >
> >
>

Re: [DRAFT] Celeborn Board Report

2024-05-04 Thread Mridul Muralidharan

Ah ! Then it makes sense to not include it :-)
Thanks for clarifying !

Regards,
Mridul


On Sat, May 4, 2024 at 4:15 AM Keyong Zhou  wrote:

> Actually it's the second one. For the first one I didn't send the draft
> to dev maillist for discussion because of lack of experience...
>
> Regards,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年5月3日周五 23:38写道：
>
> > Hi,
> >
> >   I meant call it out as part of the board report, so that it is captured
> > in our updates to board.
> >
> > This is the first update post TLP, right ?
> >
> > Regards,
> > Mridul
> >
> > On Fri, May 3, 2024 at 1:41 AM Keyong Zhou  wrote:
> >
> > > Hi Mridul,
> > >
> > > The news is posted in the following links:
> > >
> > > Apache.org:
> > >
> > >
> >
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
> > >
> > > Newswire:
> > >
> > >
> >
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
> > >
> > > X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
> > >
> > > LinkedIn:
> > >
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
> > >
> > > Besides, we also posted a blog here (in Chinese :D) :
> > > https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> > > <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
> > >
> > > It'll be great if we can call out louder, do you have any idea? : )
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Mridul Muralidharan  于2024年5月3日周五 07:40写道：
> > >
> > > > Hi,
> > > >
> > > >   Do we want to call out graduation to TLP ?
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou 
> wrote:
> > > >
> > > > > Hi community,
> > > > >
> > > > > The board report is due on May 8th, following is the draft I made,
> > any
> > > > > comments
> > > > > will be appreciated, thanks!
> > > > >
> > > > > ## Description:
> > > > > The mission of Apache Celeborn is the creation and maintenance of
> > > > software
> > > > > related to an intermediate data service for big data computing
> > engines
> > > to
> > > > > boost
> > > > > performance, stability, and flexibility
> > > > >
> > > > > ## Project Status:
> > > > > Current project status: New
> > > > > Issues for the board: None
> > > > >
> > > > > ## Membership Data:
> > > > > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > > > > There are currently 21 committers and 13 PMC members in this
> project.
> > > > > The Committer-to-PMC ratio is roughly 3:2.
> > > > >
> > > > > Community changes, past quarter:
> > > > >
> > > > > - No new PMC members (project graduated recently).
> > > > > - Chandni Singh was added as committer on 2024-03-21.
> > > > > - Mridul Muralidharan was added as committer on 2024-04-29.
> > > > >
> > > > > ## Project Activity:
> > > > > Software development activity:
> > > > >
> > > > >  - We are preparing to release 0.4.1 in May.
> > > > >  - We are preparing to release 0.5.0 in May.
> > > > >  - Security support (authentication and SSL) has been merged.
> > > > >  - Memory storage is close to being merged.
> > > > >
> > > > > Meetups and Conferences:
> > > > >
> > > > >  - An online meetup was held on April 16th with some developers.
> > > > >  - An online meetup was held on April 25th with some users.
> > > > >
> > > > > Recent releases:
> > > > >
> > > > > - 0.4.0-incubating was released on 2024-02-06.
> > > > > - 0.3.2-incubating was released on 2024-01-08.
> > > > >
> > > > > ## Community Health:
> > > > > Overall community health is good. In the past quarter,
> > dev/issues/users
> > > > > mail list had 6%/11%/1100% increase in traffic respectively.
> > > > > We expect issues traffic to be steady, but there may be fluctuation
> > for
> > > > > dev/users traffic because many of the discussions happen
> > > > > in slack/wechat/dingtalk. We are encouraging more discussion to
> > happen
> > > in
> > > > > maillists.
> > > > >
> > > > > We have been performing extensive outreach for our users, and
> > > encouraging
> > > > > them to contribute back to the project. Also, we are
> > > > > active in making a voice in various conferences to attract more
> > users.
> > > > >
> > > > > Regards,
> > > > > Keyong Zhou
> > > > >
> > > >
> > >
> >
>

Re: [DRAFT] Celeborn Board Report

2024-05-03 Thread Mridul Muralidharan

Hi,

  I meant call it out as part of the board report, so that it is captured
in our updates to board.

This is the first update post TLP, right ?

Regards,
Mridul

On Fri, May 3, 2024 at 1:41 AM Keyong Zhou  wrote:

> Hi Mridul,
>
> The news is posted in the following links:
>
> Apache.org:
>
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
>
> Newswire:
>
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
>
> X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
>
> LinkedIn:
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
>
> Besides, we also posted a blog here (in Chinese :D) :
> https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
>
> It'll be great if we can call out louder, do you have any idea? : )
>
> Regards,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年5月3日周五 07:40写道：
>
> > Hi,
> >
> >   Do we want to call out graduation to TLP ?
> >
> > Regards,
> > Mridul
> >
> > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:
> >
> > > Hi community,
> > >
> > > The board report is due on May 8th, following is the draft I made, any
> > > comments
> > > will be appreciated, thanks!
> > >
> > > ## Description:
> > > The mission of Apache Celeborn is the creation and maintenance of
> > software
> > > related to an intermediate data service for big data computing engines
> to
> > > boost
> > > performance, stability, and flexibility
> > >
> > > ## Project Status:
> > > Current project status: New
> > > Issues for the board: None
> > >
> > > ## Membership Data:
> > > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > > There are currently 21 committers and 13 PMC members in this project.
> > > The Committer-to-PMC ratio is roughly 3:2.
> > >
> > > Community changes, past quarter:
> > >
> > > - No new PMC members (project graduated recently).
> > > - Chandni Singh was added as committer on 2024-03-21.
> > > - Mridul Muralidharan was added as committer on 2024-04-29.
> > >
> > > ## Project Activity:
> > > Software development activity:
> > >
> > >  - We are preparing to release 0.4.1 in May.
> > >  - We are preparing to release 0.5.0 in May.
> > >  - Security support (authentication and SSL) has been merged.
> > >  - Memory storage is close to being merged.
> > >
> > > Meetups and Conferences:
> > >
> > >  - An online meetup was held on April 16th with some developers.
> > >  - An online meetup was held on April 25th with some users.
> > >
> > > Recent releases:
> > >
> > > - 0.4.0-incubating was released on 2024-02-06.
> > > - 0.3.2-incubating was released on 2024-01-08.
> > >
> > > ## Community Health:
> > > Overall community health is good. In the past quarter, dev/issues/users
> > > mail list had 6%/11%/1100% increase in traffic respectively.
> > > We expect issues traffic to be steady, but there may be fluctuation for
> > > dev/users traffic because many of the discussions happen
> > > in slack/wechat/dingtalk. We are encouraging more discussion to happen
> in
> > > maillists.
> > >
> > > We have been performing extensive outreach for our users, and
> encouraging
> > > them to contribute back to the project. Also, we are
> > > active in making a voice in various conferences to attract more users.
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> >
>

Re: [DRAFT] Celeborn Board Report

2024-05-02 Thread Mridul Muralidharan

Hi,

  Do we want to call out graduation to TLP ?

Regards,
Mridul

On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:

> Hi community,
>
> The board report is due on May 8th, following is the draft I made, any
> comments
> will be appreciated, thanks!
>
> ## Description:
> The mission of Apache Celeborn is the creation and maintenance of software
> related to an intermediate data service for big data computing engines to
> boost
> performance, stability, and flexibility
>
> ## Project Status:
> Current project status: New
> Issues for the board: None
>
> ## Membership Data:
> Apache Celeborn was founded 2024-03-20 (2 months ago).
> There are currently 21 committers and 13 PMC members in this project.
> The Committer-to-PMC ratio is roughly 3:2.
>
> Community changes, past quarter:
>
> - No new PMC members (project graduated recently).
> - Chandni Singh was added as committer on 2024-03-21.
> - Mridul Muralidharan was added as committer on 2024-04-29.
>
> ## Project Activity:
> Software development activity:
>
>  - We are preparing to release 0.4.1 in May.
>  - We are preparing to release 0.5.0 in May.
>  - Security support (authentication and SSL) has been merged.
>  - Memory storage is close to being merged.
>
> Meetups and Conferences:
>
>  - An online meetup was held on April 16th with some developers.
>  - An online meetup was held on April 25th with some users.
>
> Recent releases:
>
> - 0.4.0-incubating was released on 2024-02-06.
> - 0.3.2-incubating was released on 2024-01-08.
>
> ## Community Health:
> Overall community health is good. In the past quarter, dev/issues/users
> mail list had 6%/11%/1100% increase in traffic respectively.
> We expect issues traffic to be steady, but there may be fluctuation for
> dev/users traffic because many of the discussions happen
> in slack/wechat/dingtalk. We are encouraging more discussion to happen in
> maillists.
>
> We have been performing extensive outreach for our users, and encouraging
> them to contribute back to the project. Also, we are
> active in making a voice in various conferences to attract more users.
>
> Regards,
> Keyong Zhou
>

Re: [ANNOUNCE] Add Mridul Muralidharan as new committer

2024-04-29 Thread Mridul Muralidharan

Thank you everyone :-)
It has been a pleasure working with the Celeborn community, and I look
forward to continuing to learn from and contribute to the project !

Regards,
Mridul

On Mon, Apr 29, 2024 at 12:38 AM Fu Chen  wrote:

> Congratulations and thank you to Mridul for the contributions to the
> community!
>
> Regards,
> Fu Chen
>
> Cheng Pan  于2024年4月29日周一 12:10写道：
> >
> > Congrats Mridul, your expertise in Spark kernel, and the Security area
> are impressive.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Apr 29, 2024, at 09:21, Keyong Zhou  wrote:
> > >
> > > Hi Celeborn Community,
> > >
> > > The Project Management Committee (PMC) for Apache Celeborn
> > > has invited Mridul Muralidharan to become a committer and we are
> pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Mridul Muralidharan!
> > >
> > > Regards,
> > > Keyong Zhou
> >
>

Re: [ANNOUNCE] Add Mridul Muralidharan as new committer

2024-04-29 Thread Mridul Muralidharan

Thank you everyone :-)
It has been a pleasure working with the Celeborn community, and I look
forward to continuing to learn from and contribute to the project !

Regards,
Mridul

On Mon, Apr 29, 2024 at 12:38 AM Fu Chen  wrote:

> Congratulations and thank you to Mridul for the contributions to the
> community!
>
> Regards,
> Fu Chen
>
> Cheng Pan  于2024年4月29日周一 12:10写道：
> >
> > Congrats Mridul, your expertise in Spark kernel, and the Security area
> are impressive.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Apr 29, 2024, at 09:21, Keyong Zhou  wrote:
> > >
> > > Hi Celeborn Community,
> > >
> > > The Project Management Committee (PMC) for Apache Celeborn
> > > has invited Mridul Muralidharan to become a committer and we are
> pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Mridul Muralidharan!
> > >
> > > Regards,
> > > Keyong Zhou
> >
>

[jira] [Updated] (SPARK-47919) credulousTrustStoreManagers should return empty array for accepted issuers

2024-04-19 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47919:

Priority: Minor  (was: Major)

> credulousTrustStoreManagers should return empty array for accepted issuers
> --
>
> Key: SPARK-47919
> URL: https://issues.apache.org/jira/browse/SPARK-47919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>    Reporter: Mridul Muralidharan
>Priority: Minor
>
> {{TrustManager}} in  {{SSLFactory.credulousTrustStoreManagers}} currently 
> returns {{null}}, but should be returning an empty array instead - as per 
> javadoc and expectation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47919) credulousTrustStoreManagers should return empty array for accepted issuers

2024-04-19 Thread Mridul Muralidharan (Jira)

Mridul Muralidharan created SPARK-47919:
---

 Summary: credulousTrustStoreManagers should return empty array for 
accepted issuers
 Key: SPARK-47919
 URL: https://issues.apache.org/jira/browse/SPARK-47919
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Mridul Muralidharan


{{TrustManager}} in  {{SSLFactory.credulousTrustStoreManagers}} currently 
returns {{null}}, but should be returning an empty array instead - as per 
javadoc and expectation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Time for 0.4.1

2024-04-19 Thread Mridul Muralidharan

+1

Regards,
Mridul


On Thu, Apr 18, 2024 at 11:50 PM Ethan Feng  wrote:

> +1
>
> Thanks,
> Ethan Feng
>
> Yu Li  于2024年4月16日周二 17:20写道：
> >
> > +1, thanks for driving this and volunteering as our RM, Nicholas!
> >
> > Best Regards,
> > Yu
> >
> > On Sat, 13 Apr 2024 at 10:31, Keyong Zhou  wrote:
> > >
> > > +1, thanks Nicholas for volunteering!
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Shaoyun Chen  于2024年4月12日周五 22:03写道：
> > >
> > > > +1
> > > >
> > > > Cheng Pan  于2024年4月12日周五 20:04写道：
> > > > >
> > > > > +1, we do need a patch release for 0.4
> > > > >
> > > > > Thanks,
> > > > > Cheng Pan
> > > > >
> > > > >
> > > > > > On Apr 12, 2024, at 19:59, Nicholas Jiang <
> nicholasji...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > Hey, Celeborn community,
> > > > > >
> > > > > >
> > > > > > It has been a while since the 0.4.0 release, and there are some
> > > > critical fixes land branch-0.4, for example,
> [CELEBORN-1252][FOLLOWUP] Fix
> > > > Worker#computeResourceConsumption NullPointerException for
> > > > userResourceConsumption that does not contain given userIdentifier.
> From my
> > > > perspective, it’s time to prepare for releasing 0.4.1.
> > > > > >
> > > > > >
> > > > > > WDYT? And I’m volunteering to be the release manager if no one
> has
> > > > applied.
> > > > > >
> > > > > > Regards,
> > > > > > Nicholas Jiang
> > > > >
> > > > >
> > > >
>

[jira] [Comment Edited] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-17 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838474#comment-17838474
 ] 

Mridul Muralidharan edited comment on SPARK-47172 at 4/18/24 5:47 AM:
--

We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, which would typically be 
out of scope for 3.x
Given TLS, it would not very useful for 4.x either ?


was (Author: mridulm80):
We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, while would be out of scope 
for 3.x
Given TLS, not very useful for 4.x either ?

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-17 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838474#comment-17838474
 ] 

Mridul Muralidharan commented on SPARK-47172:
-

We do not backport features to released versions - so TLS will be in 4.x, not 
3.x
Given the security implications for  SPARK-47318, it was backported to 3.4 and 
3.5 - as it was fixing a security issue in existing functionality.

This proposal reads like a new feature development, while would be out of scope 
for 3.x
Given TLS, not very useful for 4.x either ?

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47172) Upgrade Transport block cipher mode to GCM

2024-04-16 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837927#comment-17837927
 ] 

Mridul Muralidharan commented on SPARK-47172:
-

Given we have addressed SPARK-47318, and with Spark supporting TLS from 4.0 - 
is this a concern ?
Given it will be backwardly incompatible, I am hesitant to expand support for 
something which is expected to go away "soon".

> Upgrade Transport block cipher mode to GCM
> --
>
> Key: SPARK-47172
> URL: https://issues.apache.org/jira/browse/SPARK-47172
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Steve Weis
>Priority: Minor
>
> The cipher transformation currently used for encrypting RPC calls is an 
> unauthenticated mode (AES/CTR/NoPadding). This needs to be upgraded to an 
> authenticated mode (AES/GCM/NoPadding) to prevent ciphertext from being 
> modified in transit.
> The relevant line is here: 
> [https://github.com/apache/spark/blob/a939a7d0fd9c6b23c879cbee05275c6fbc939e38/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L220]
> GCM is relatively more computationally expensive than CTR and adds a 16-byte 
> block of authentication tag data to each payload. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Spark 3.4.3 (RC2)

2024-04-15 Thread Mridul Muralidharan

+1

Signatures, digests, etc check out fine.
Checked out tag and build/tested with -Phive -Pyarn -Pkubernetes

Regards,
Mridul


On Sun, Apr 14, 2024 at 11:31 PM Dongjoon Hyun  wrote:

> I'll start with my +1.
>
> - Checked checksum and signature
> - Checked Scala/Java/R/Python/SQL Document's Spark version
> - Checked published Maven artifacts
> - All CIs passed.
>
> Thanks,
> Dongjoon.
>
> On 2024/04/15 04:22:26 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.4.3.
> >
> > The vote is open until April 18th 1AM (PDT) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.4.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.4.3-rc2 (commit
> > 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f)
> > https://github.com/apache/spark/tree/v3.4.3-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1453/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.4.3-rc2-docs/
> >
> > The list of bug fixes going into 3.4.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353987
> >
> > This release is using the release script of the tag v3.4.3-rc2.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.4.3?
> > ===
> >
> > The current list of open tickets targeted at 3.4.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.4.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[jira] [Assigned] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47253:
---

Assignee: TakawaAkirayo

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47253) Allow LiveEventBus to stop without the completely draining of event queue

2024-04-12 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47253.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45367
[https://github.com/apache/spark/pull/45367]

> Allow LiveEventBus to stop without the completely draining of event queue
> -
>
> Key: SPARK-47253
> URL: https://issues.apache.org/jira/browse/SPARK-47253
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Assignee: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> #Problem statement:
> The SparkContext.stop() hung a long time on LiveEventBus.stop() when 
> listeners slow
> #User scenarios:
> We have a centralized service with multiple instances to regularly execute 
> user's scheduled tasks.
> For each user task within one service instance, the process is as follows:
> 1.Create a Spark session directly within the service process with an account 
> defined in the task.
> 2.Instantiate listeners by class names and register them with the 
> SparkContext. The JARs containing the listener classes are uploaded to the 
> service by the user.
> 3.Prepare resources.
> 4.Run user logic (Spark SQL).
> 5.Stop the Spark session by invoking SparkSession.stop().
> In step 5, it will wait for the LiveEventBus to stop, which requires the 
> remaining events to be completely drained by each listener.
> Since the listener is implemented by users and we cannot prevent some heavy 
> stuffs within the listener on each event, there are cases where a single 
> heavy job has over 30,000 tasks,
> and it could take 30 minutes for the listener to process all the remaining 
> events, because within the listener, it requires a coarse-grained global lock 
> and update the internal status to the remote database.
> This kind of delay affects other user tasks in the queue. Therefore, from the 
> server side perspective, we need the guarantee that the stop operation 
> finishes quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47318:

Affects Version/s: 3.5.0
   3.4.0

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47318) AuthEngine key exchange needs additional KDF round

2024-04-11 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47318.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45425
[https://github.com/apache/spark/pull/45425]

>  AuthEngine key exchange needs additional KDF round
> ---
>
> Key: SPARK-47318
> URL: https://issues.apache.org/jira/browse/SPARK-47318
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 4.0.0
>Reporter: Steve Weis
>Assignee: Steve Weis
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> AuthEngine implements a bespoke [key exchange protocol 
> |[https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto]|https://github.com/apache/spark/tree/master/common/network-common/src/main/java/org/apache/spark/network/crypto].]
>  based on the NNpsk0 Noise pattern and using X25519.
> The Spark code improperly uses the derived shared secret directly, which is 
> an encoded X coordinate. This should be passed into a KDF rather than used 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan edited comment on SPARK-47759 at 4/10/24 4:08 AM:
--

In order to validate, I would suggest two things.
Wrap the input str (or lower), within quotes, in the exception message ... for 
example, "120s\u00A0" will look like 120s in the exception message as it is a 
unicode non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.


was (Author: mridulm80):
In order to validate, I would suggest two things.
Wrap the input str, within quote, in the exception message ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.f

[jira] [Comment Edited] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan edited comment on SPARK-47759 at 4/10/24 4:08 AM:
--

In order to validate, I would suggest two things.
Wrap the input str, within quote, in the exception message ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.


was (Author: mridulm80):
In order to validate, I would suggest two things.
Wrap the input str, within quote, to the exception ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.f

[jira] [Commented] (SPARK-47759) Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate time string

2024-04-09 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835587#comment-17835587
 ] 

Mridul Muralidharan commented on SPARK-47759:
-

In order to validate, I would suggest two things.
Wrap the input str, within quote, to the exception ... for example, 
"120s\u00A0" will look like 120s in the exception message as it is a unicode 
non breaking space.
The other would be to include the NumberFormatException 'e' as the cause in the 
exception being thrown.

Once you are able to get a stack trace with these two change in place, it 
should help us debug this better.

> Apps being stuck after JavaUtils.timeStringAs fails to parse a legitimate 
> time string
> -
>
> Key: SPARK-47759
> URL: https://issues.apache.org/jira/browse/SPARK-47759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Bo Xiong
>Assignee: Bo Xiong
>Priority: Critical
>  Labels: hang, pull-request-available, stuck, threadsafe
> Fix For: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h2. Symptom
> It's observed that our Spark apps occasionally got stuck with an unexpected 
> stack trace when reading/parsing a legitimate time string. Note that we 
> manually killed the stuck app instances and the retry goes thru on the same 
> cluster (without requiring any app code change).
>  
> *[Stack Trace 1]* The stack trace doesn't make sense since *120s* is a 
> legitimate time string, where the app runs on emr-7.0.0 with Spark 3.5.0 
> runtime.
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.NumberFormatException: Time 
> must be specified as seconds (s), milliseconds (ms), microseconds (us), 
> minutes (m or min), hour (h), or day (d). E.g. 50s, 100ms, or 250us.
> Failed to parse time string: 120s
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAs(JavaUtils.java:258)
> at 
> org.apache.spark.network.util.JavaUtils.timeStringAsSec(JavaUtils.java:275)
> at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1166)
> at org.apache.spark.rpc.RpcTimeout$.apply(RpcTimeout.scala:131)
> at org.apache.spark.util.RpcUtils$.askRpcTimeout(RpcUtils.scala:41)
> at org.apache.spark.rpc.RpcEndpointRef.(RpcEndpointRef.scala:33)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.(NettyRpcEnv.scala:533)
> at org.apache.spark.rpc.netty.RequestMessage$.apply(NettyRpcEnv.scala:640)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:697)
> at 
> org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:682)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChann

Re: Versioning of Spark Operator

2024-04-09 Thread Mridul Muralidharan

  I am trying to understand if we can simply align with Spark's version for
this ?
Makes the release and jira management much more simpler for developers and
intuitive for users.

Regards,
Mridul


On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  wrote:

> Hi, Liang-Chi.
>
> Thank you for leading Apache Spark K8s operator as a shepherd.
>
> I took a look at `Apache Spark Connect Go` repo mentioned in the thread.
> Sadly, there is no release at all and no activity since last 6 months. It
> seems to be the first time for Apache Spark community to consider these
> sister repositories (Go and K8s Operator).
>
> https://github.com/apache/spark-connect-go/commits/master/
>
> Dongjoon.
>
> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > Hi all,
> >
> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > and the first PR is created.
> > Thank you for the review from the community so far.
> >
> > About the versioning of Spark Operator, there are questions.
> >
> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> > choose a Spark version. However, the Spark Operator is versioning
> > differently than Spark. I'm wondering how we deal with this?
> >
> > Not sure if Connect also has its versioning different to Spark? If so,
> > maybe we can follow how Connect does.
> >
> > Can someone who is familiar with Connect versioning give some
> suggestions?
> >
> > Thank you.
> >
> > Liang-Chi
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.4.3 (?)

2024-04-06 Thread Mridul Muralidharan

Hi Dongjoon,

  Thanks for volunteering !
I would suggest to wait for SPARK-47318 to be merged as well for 3.4

Regards,
Mridul

On Sat, Apr 6, 2024 at 6:49 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Mridul Muralidharan

+1

Regards,
Mridul


On Mon, Apr 1, 2024 at 11:26 PM Holden Karau  wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Hyukjin Kwon 
>>
>> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
>> wrote:
>>
>>> +1
>>> --
>>> *From:* Denny Lee 
>>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>>> *To:* Hussein Awala 
>>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>>> Mridul Muralidharan ; dev 
>>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>>
>>> +1 (non-binding)
>>>
>>>
>>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>>
>>> +1(non-binding) I add to the difference will it make that it will also
>>> simplify package maintenance and easily release a bug fix/new feature
>>> without needing to wait for Pyspark to release.
>>>
>>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA <https://issues.apache.org/jira/browse/SPARK-47540>
>>> Prototype <https://github.com/apache/spark/pull/45053>
>>> SPIP doc
>>> <https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Mridul Muralidharan

Can you point me to the SPIP’s discussion thread please ?
I was not able to find it, but I was on vacation, and so might have missed
this …

Regards,
Mridul

On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
 wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [ANNOUNCE] Apache Celeborn is graduated to Top Level Project

2024-03-26 Thread Mridul Muralidharan

Congratulations !!

Regards,
Mridul


On Tue, Mar 26, 2024 at 11:54 PM Nicholas Jiang 
wrote:

> Congratulations! Witness the continuous development of the
> community.Regards,
> Nicholas Jiang
> At 2024-03-25 20:49:36, "Ethan Feng"  wrote:
> >Hello Celeborn community,
> >
> >I am glad to share that the ASF board has approved a resolution to
> >graduate Celeborn into a full Top Level Project. Thank you all for
> >your help in reaching this milestone.
> >
> >To transition from the Apache Incubator to a new TLP, there are a few
> >action items[1] we need to complete the transition. I have opened an
> >Umbrella Issue[2] to track the tasks, and you are welcome to take on
> >the sub-tasks and leave comments if I have missed anything.
> >
> >Additionally, the GitHub repository migration is already complete[3].
> >Please update your local git repository to track the new repo[4]. If
> >you named the upstream as "apache", you can run the following command
> >to complete the remote repo tracking migration.
> >
> >` git remote set-url apache g...@github.com:apache/celeborn.git `
> >
> >Please find the relevant URLs below:
> >[1]
> https://incubator.apache.org/guides/transferring.html#life_after_graduation
> >[2] https://github.com/apache/celeborn/issues/2415
> >[3] https://issues.apache.org/jira/browse/INFRA-25635
> >[4] https://github.com/apache/celeborn
> >
> >Thanks,
> >Ethan Feng
>

Re: Maven 'stuck' in service test compilation ?

2024-03-21 Thread Mridul Muralidharan

Hi Ethan,

  Thanks for checking !
It appears that my desktop java version had been upgraded, which resulted
in the failures ...
Reverting it back to java 8 fixed the issues seen.

Regards,
Mridul


On Thu, Mar 21, 2024 at 7:27 AM Ethan Feng  wrote:

> Hi Mridul,
>
> I've tried your scripts on my local environment(JDK8) and the problem
> is not reproduced. Both Maven 3.8.8 and 3.9.6 are tested.
> Changing  might help as I've encountered some maven bugs
> before.
>
> I think there needs more information about how to reproduce this
> problem, like the environmental information, JDK version, etc.
>
> Regards,
> Ethan Feng
>
> Mridul Muralidharan  于2024年3月21日周四 15:25写道：
> >
> > Hi,
> >
> >
> >   I am observing that a maven build gets 'stuck' when compiling
> "services"
> > for running tests.
> > Without tests, this goes through:
> >
> > $ ARGS="-Pspark-3.1"; ./build/mvn $ARGS clean 2>&1 | tee clean_output.txt
> > && ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt
> >
> > This gets stuck indefinitely:
> >
> > $ ARGS="-Pspark-3.1"; ./build/mvn  $ARGS package 2>&1 | tee
> test_output.txt
> > See [1] for output snippet.
> >
> > Strangely, running with -X seemed to be fine (the one time I tried it).
> >
> > I have made some dependency changes to pom.xml, but no changes to
> > service module.
> >
> > Anything I am missing here ? Any hints would be greatly appreciated :-)
> >
> > Thanks !
> > Mridul
> >
> > [1]
> >
> > [INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @
> > celeborn-service_2.12 ---
> > [INFO] Using 'UTF-8' encoding to copy filtered resources.
> > [INFO] Using 'UTF-8' encoding to copy filtered properties files.
> > [INFO] Copying 1 resource
> > [INFO] Copying 3 resources
> > [INFO]
> > [INFO] --- scala-maven-plugin:4.7.2:compile (scala-compile-first) @
> > celeborn-service_2.12 ---
> > [INFO] Using incremental compilation using Mixed compile order
> > [INFO] Compiler bridge file:
> >
> /home/mridul/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.7.1-bin_2.12.10__61.0-1.7.1_20220712T022208.jar
> > [INFO] compiler plugin:
> > BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
> > [INFO] compiling 4 Scala sources and 9 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 2 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 1 Scala source and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubato

Re: [ANNOUNCE] Add Chandni Singh as new committer

2024-03-21 Thread Mridul Muralidharan

Congratulations Chandni ! Great job :-)

Regards,
Mridul


On Thu, Mar 21, 2024 at 3:30 AM Keyong Zhou  wrote:

> Hi Celeborn Community,
>
> The Podling Project Management Committee (PPMC) for Apache Celeborn
> has invited Chandni Singh to become a committer and we are pleased
> to announce that she has accepted.
>
> Being a committer enables easier contribution to the
> project since there is no need to go via the patch
> submission process. This should enable better productivity.
> A (P)PMC member helps manage and guide the direction of the project.
>
> Please join me in congratulating Chandni Singh!
>
> Thanks,
> Keyong Zhou
>

Re: [ANNOUNCE] Add Chandni Singh as new committer

2024-03-21 Thread Mridul Muralidharan

Congratulations Chandni ! Great job :-)

Regards,
Mridul


On Thu, Mar 21, 2024 at 3:30 AM Keyong Zhou  wrote:

> Hi Celeborn Community,
>
> The Podling Project Management Committee (PPMC) for Apache Celeborn
> has invited Chandni Singh to become a committer and we are pleased
> to announce that she has accepted.
>
> Being a committer enables easier contribution to the
> project since there is no need to go via the patch
> submission process. This should enable better productivity.
> A (P)PMC member helps manage and guide the direction of the project.
>
> Please join me in congratulating Chandni Singh!
>
> Thanks,
> Keyong Zhou
>

Maven 'stuck' in service test compilation ?

2024-03-21 Thread Mridul Muralidharan

Hi,


  I am observing that a maven build gets 'stuck' when compiling "services"
for running tests.
Without tests, this goes through:

$ ARGS="-Pspark-3.1"; ./build/mvn $ARGS clean 2>&1 | tee clean_output.txt
&& ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt

This gets stuck indefinitely:

$ ARGS="-Pspark-3.1"; ./build/mvn  $ARGS package 2>&1 | tee test_output.txt
See [1] for output snippet.

Strangely, running with -X seemed to be fine (the one time I tried it).

I have made some dependency changes to pom.xml, but no changes to
service module.

Anything I am missing here ? Any hints would be greatly appreciated :-)

Thanks !
Mridul

[1]

[INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @
celeborn-service_2.12 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Using 'UTF-8' encoding to copy filtered properties files.
[INFO] Copying 1 resource
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:4.7.2:compile (scala-compile-first) @
celeborn-service_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file:
/home/mridul/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.7.1-bin_2.12.10__61.0-1.7.1_20220712T022208.jar
[INFO] compiler plugin:
BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] compiling 4 Scala sources and 9 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 2 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 1 Scala source and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling

And then keeps indefinitely repeating this.

[jira] [Assigned] (SPARK-45375) [CORE] Mark connection as timedOut in TransportClient.close

2024-03-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45375:
---

Assignee: Hasnain Lakhani

> [CORE] Mark connection as timedOut in TransportClient.close
> ---
>
> Key: SPARK-45375
> URL: https://issues.apache.org/jira/browse/SPARK-45375
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Avoid a race condition where a connection which is in the process of being 
> closed could be returned by the TransportClientFactory only to be immediately 
> closed and cause errors upon use
>  
> This doesn't happen much in practice but is observed more frequently as part 
> of efforts to add SSL support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:28 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR, please lte me know if you are unable to do it for the 
others


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45374:
---

Assignee: Hasnain Lakhani

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan edited comment on SPARK-45374 at 3/18/24 8:05 PM:
--

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

I did it for this PR


was (Author: mridulm80):
Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45374) [CORE] Add test keys for SSL functionality

2024-03-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828087#comment-17828087
 ] 

Mridul Muralidharan commented on SPARK-45374:
-

Missed your query, you can link by:
"more' -> link -> Web Link ->
* URL  == pr url
* Link test == "GitHub Pull Request #"

> [CORE] Add test keys for SSL functionality
> --
>
> Key: SPARK-45374
> URL: https://issues.apache.org/jira/browse/SPARK-45374
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>
> Add test SSL keys which will be used for unit and integration tests of the 
> new SSL RPC functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-03-18 Thread Mridul Muralidharan

Hi Ashish,

  This is something we are still actively working on internally, but is
unfortunately not yet in a state to share widely yet.

Regards,
Mridul

On Mon, Mar 11, 2024 at 6:23 PM Ashish Singh  wrote:

> Hi Kalyan,
>
> Is this something you are still interested in pursuing? There are some
> open discussion threads on the doc you shared.
>
> @Mridul Muralidharan  In what state are your efforts
> along this? Is it something that your team is actively pursuing/ building
> or are mostly planning right now? Asking so that we can align efforts on
> this.
>
> On Sun, Feb 18, 2024 at 10:32 PM xiaoping.huang <1754789...@qq.com> wrote:
>
>> Hi all,
>> Any updates on this project? This will be a very useful feature.
>>
>> xiaoping.huang
>> 1754789...@qq.com
>>
>>  Replied Message ----
>> From kalyan 
>> Date 02/6/2024 10:08
>> To Jay Han 
>> Cc Ashish Singh ,
>>  Mridul Muralidharan ,
>>  dev ,
>>  
>> 
>> Subject Re: [Spark-Core] Improving Reliability of spark when Executors
>> OOM
>> Hey,
>> Disk space not enough is also a reliability concern, but might need a
>> diff strategy to handle it.
>> As suggested by Mridul, I am working on making things more configurable
>> in another(new) module… with that, we can plug in new rules for each type
>> of error.
>>
>> Regards
>> Kalyan.
>>
>> On Mon, 5 Feb 2024 at 1:10 PM, Jay Han  wrote:
>>
>>> Hi,
>>> what about supporting for solving the disk space problem of "device
>>> space isn't enough"? I think it's same as OOM exception.
>>>
>>> kalyan  于2024年1月27日周六 13:00写道：
>>>
>>>> Hi all,
>>>>
>>>
>>>> Sorry for the delay in getting the first draft of (my first) SPIP out.
>>>>
>>>> https://docs.google.com/document/d/1hxEPUirf3eYwNfMOmUHpuI5dIt_HJErCdo7_yr9htQc/edit?pli=1
>>>>
>>>> Let me know what you think.
>>>>
>>>> Regards
>>>> kalyan.
>>>>
>>>> On Sat, Jan 20, 2024 at 8:19 AM Ashish Singh  wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> Thanks for this discussion, the timing of this couldn't be better!
>>>>>
>>>>> At Pinterest, we recently started to look into reducing OOM failures
>>>>> while also reducing memory consumption of spark applications. We 
>>>>> considered
>>>>> the following options.
>>>>> 1. Changing core count on executor to change memory available per task
>>>>> in the executor.
>>>>> 2. Changing resource profile based on task failures and gc metrics to
>>>>> grow or shrink executor memory size. We do this at application level based
>>>>> on the app's past runs today.
>>>>> 3. K8s vertical pod autoscaler
>>>>> <https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler>
>>>>>
>>>>> Internally, we are mostly getting aligned on option 2. We would love
>>>>> to make this happen and are looking forward to the SPIP.
>>>>>
>>>>>
>>>>> On Wed, Jan 17, 2024 at 9:34 AM Mridul Muralidharan 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>   We are internally exploring adding support for dynamically changing
>>>>>> the resource profile of a stage based on runtime characteristics.
>>>>>> This includes failures due to OOM and the like, slowness due to
>>>>>> excessive GC, resource wastage due to excessive overprovisioning, etc.
>>>>>> Essentially handles scale up and scale down of resources.
>>>>>> Instead of baking these into the scheduler directly (which is already
>>>>>> complex), we are modeling it as a plugin - so that the 'business logic' 
>>>>>> of
>>>>>> how to handle task events and mutate state is pluggable.
>>>>>>
>>>>>> The main limitation I find with mutating only the cores is the limits
>>>>>> it places on what kind of problems can be solved with it - and mutating
>>>>>> resource profiles is a much more natural way to handle this
>>>>>> (spark.task.cpus predates RP).
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>> On Wed, Jan 17, 2024 at 9:18 AM Tom

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Mridul Muralidharan

  I am supportive of the proposal - this is a step in the right direction !
Additional metadata (explicit and inferred) for log records, and exposing
them for indexing is extremely useful.

The specifics of the API still need some work IMO and does not need to be
this disruptive, but I consider that is orthogonal to this vote itself -
and something we need to iterate upon during PR reviews.

+1

Regards,
Mridul


On Mon, Mar 11, 2024 at 11:09 AM Mich Talebzadeh 
wrote:

> +1
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 11 Mar 2024 at 09:27, Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Mon, 11 Mar 2024 at 18:11, yangjie01 
>> wrote:
>>
>>> +1
>>>
>>>
>>>
>>> Jie Yang
>>>
>>>
>>>
>>> *发件人**: *Haejoon Lee 
>>> *日期**: *2024年3月11日 星期一 17:09
>>> *收件人**: *Gengliang Wang 
>>> *抄送**: *dev 
>>> *主题**: *Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Structured Logging Framework for
>>> Apache Spark
>>>
>>>
>>> References:
>>>
>>>- JIRA ticket
>>>
>>> 
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>>
>>> Gengliang Wang
>>>
>>>

[jira] [Updated] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-47146:

Fix Version/s: 3.5.2
   3.4.3

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
>

[jira] [Commented] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-05 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823832#comment-17823832
 ] 

Mridul Muralidharan commented on SPARK-47146:
-

Backported to 3.5 and 3.4 in PR: https://github.com/apache/spark/pull/45390

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
>

[jira] [Assigned] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-47146:
---

Assignee: JacobZheng

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
> data in the stream is read through.
> {code:java}
> @Override
> public void loadNext() throws IOException {
>   // Kill the task

[jira] [Resolved] (SPARK-47146) Possible thread leak when doing sort merge join

2024-03-04 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-47146.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45327
[https://github.com/apache/spark/pull/45327]

> Possible thread leak when doing sort merge join
> ---
>
> Key: SPARK-47146
> URL: https://issues.apache.org/jira/browse/SPARK-47146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: JacobZheng
>Assignee: JacobZheng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> I have a long-running spark job. stumbled upon executor taking up a lot of 
> threads, resulting in no threads available on the server. Querying thread 
> details via jstack, there are tons of threads named read-ahead. Checking the 
> code confirms that these threads are created by ReadAheadInputStream. This 
> class is initialized to create a single-threaded thread pool
> {code:java}
> private final ExecutorService executorService =
> ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code}
> This thread pool is closed by ReadAheadInputStream#close(). 
> The call stack for the normal case close() method is
> {code:java}
> ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 
> in stage 71.0 (TID 
> 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230
>     @org.apache.spark.io.ReadAheadInputStream.close()
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87)
>         at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187)
>         at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779)
>         at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>         at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
>         at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>         at org.apache.spark.scheduler.Task.run(Task.scala:139)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829) {code}
> As shown in UnsafeSorterSpillReader#close, the stream is only closed when the 
>

Re: [DISCUSS] SPIP: Structured Spark Logging

2024-03-02 Thread Mridul Muralidharan

Hi Gengling,

  Thanks for sharing this !
I added a few queries to the proposal doc, and we can continue discussing
there, but overall I am in favor of this.

Regards,
Mridul


On Fri, Mar 1, 2024 at 1:35 AM Gengliang Wang  wrote:

> Hi All,
>
> I propose to enhance our logging system by transitioning to structured
> logs. This initiative is designed to tackle the challenges of analyzing
> distributed logs from drivers, workers, and executors by allowing them to
> be queried using a fixed schema. The goal is to improve the informativeness
> and accessibility of logs, making it significantly easier to diagnose
> issues.
>
> Key benefits include:
>
>- Clarity and queryability of distributed log files.
>- Continued support for log4j, allowing users to switch back to
>traditional text logging if preferred.
>
> The improvement will simplify debugging and enhance productivity without
> disrupting existing logging practices. The implementation is estimated to
> take around 3 months.
>
> *SPIP*:
> https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing
> *JIRA*: SPARK-47240 
>
> Your comments and feedback would be greatly appreciated.
>

Re: [VOTE] Graduate Apache Celeborn (incubating) as a TLP - Community

2024-03-01 Thread Mridul Muralidharan

+1

Regards,
Mridul


On Fri, Mar 1, 2024 at 4:35 AM Nicholas  wrote:

>
> +1.
>
>
> Regards,
> Nicholas Jiang
>
>
>
>
> --
> 发自我的网易邮箱手机智能版
> 
>
>
> - Original Message -
> From: "Yu Li" 
> To: dev@celeborn.apache.org
> Sent: Fri, 1 Mar 2024 16:52:10 +0800
> Subject: [VOTE] Graduate Apache Celeborn (incubating) as a TLP - Community
>
> Hi All,
>
> After a thorough discussion [1], I'd like to call a formal vote to
> graduate Apache Celeborn (incubating) as a TLP. Below are some facts
> and project highlights carried from [1] as well as the draft
> resolution:
>
> - Currently, our community consists of 19 committers (including
> mentors) from more than 10 companies, with 12 serving as PPMC members.
> - So far, we have boasted 81 contributors.
> - Throughout the incubation period, we've made 6 releases in 16
> months, at a stable pace.
> - We've had 6 different release managers to date.
> - Our software is used in production by 10+ well known entities.
> - As yet, we have opened 1,286 issues with 1,176 successfully resolved.
> - We have submitted a total of 1,816 PRs, out of which 1,805 have been
> merged or closed.
> - Through self-assessment [2], we have met all maturity criteria as
> outlined in [3].
>
> We've resolved all branding issues which include Logo, GitHub repo,
> document, website, and others [4] [5].
>
> --
> Establish the Apache Celeborn Project
>
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to an intermediate data service for big data computing engines
> to boost performance, stability, and flexibility.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
>
> RESOLVED, that the Apache Celeborn Project be and hereby is responsible
> for the creation and maintenance of software related to an intermediate
> data service for big data computing engines to boost performance,
> stability, and flexibility; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Celeborn
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Celeborn
> Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Celeborn
> Project:
>
>  * Becket Qin
>  * Cheng Pan 
>  * Duo Zhang 
>  * Ethan Feng
>  * Fu Chen   
>  * Jiashu Xiong  
>  * Kerwin Zhang  
>  * Keyong Zhou   
>  * Lidong Dai
>  * Willem Ning Jiang 
>  * Wu Wei
>  * Yi Zhu
>  * Yu Li 
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Keyong Zhou be appointed to
> the office of Vice President, Apache Celeborn, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
>
> RESOLVED, that the Apache Celeborn Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Celeborn
> podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Celeborn podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
> --
>
> Best Regards,
> Yu
>
> [1] https://lists.apache.org/thread/z17rs0mw4nyv0s112dklmv7s3j053mby
> [2]
> https://cwiki.apache.org/confluence/display/CELEBORN/Apache+Maturity+Model+Assessment+for+Celeborn
> [3]
> https://community.apache.org/apache-way/apache-project-maturity-model.html
> [4] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> [5] https://whimsy.apache.org/pods/project/celeborn
>

Re: [DISCUSS] Graduate Celeborn as TLP

2024-02-28 Thread Mridul Muralidharan

+1
Looking forward to Celeborn as a TLP !

Best wishes to the community :-)

Regards,
Mridul


On Tue, Feb 27, 2024 at 5:23 AM Willem Jiang  wrote:

> Thanks for the clarification. Now we are good to go.
>
> Willem Jiang
>
>
>
> On Tue, Feb 27, 2024 at 7:15 PM Keyong Zhou  wrote:
> >
> > Thanks Willian for the information, as Cheng said, we didn't start the
> > registration process before :)
> >
> > Best,
> > Keyong Zhou
> >
> > Willem Jiang  于2024年2月27日周二 18:48写道：
> >
> > > It‘s OK if we don't register any trademark of Celeborn.
> > > If we already registered the trademark of Celeborn, we need to have
> > > the approval of the trademark VP.
> > >
> > > Willem Jiang
> > >
> > >
> > > On Tue, Feb 27, 2024 at 6:19 PM Cheng Pan  wrote:
> > > >
> > > > Hi Willem,
> > > >
> > > > For trademark concerns, the "Apache Celeborn” gets approval by
> ASF[1],
> > > do we need any additional work?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > > >
> > > > Thanks,
> > > > Cheng Pan
> > > >
> > > >
> > > > > On Feb 27, 2024, at 17:45, Willem Jiang 
> > > wrote:
> > > > >
> > > > > +1， it's good to see Celeborn is ready for graduation.
> > > > >
> > > > > I have a quick question about Celeborn's trademark. Did we start
> the
> > > > > registration process before?
> > > > >
> > > > > BTW  the podling name search is approved by trademark VP [1]
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > > > >
> > > > > Willem Jiang
> > > > >
> > > > > On Tue, Feb 27, 2024 at 9:40 AM Yu Li  wrote:
> > > > >>
> > > > >> Dear Celeborn Devs,
> > > > >>
> > > > >> We, the Celeborn community, began our incubation journey on
> October
> > > > >> 18, 2022. Since then, with the continuous efforts of you all, our
> > > > >> community has steadily developed and gradually matured,
> approaching
> > > > >> the graduation criteria [1]. Therefore, I'd like to call a
> discussion
> > > > >> to graduate Celeborn as TLP. Below are some statistics I
> collected,
> > > > >> please check it and let me know your thoughts.
> > > > >>
> > > > >> - Currently, our community consists of 19 committers (including
> > > > >> mentors) from more than 10 companies, with 12 serving as PPMC
> members
> > > > >> [2].
> > > > >> - So far, we have boasted 81 contributors.
> > > > >> - Throughout the incubation period, we've made 6 releases [3] in
> 16
> > > > >> months, at a stable pace.
> > > > >> - We've had 6 different release managers to date.
> > > > >> - Our software is used in production by 10+ well known entities
> [4].
> > > > >> - As yet, we have opened 1,286 issues with 1,176 successfully
> > > resolved [5].
> > > > >> - We have submitted a total of 1,816 PRs, out of which 1,805 have
> been
> > > > >> merged or closed [6].
> > > > >> - Through self-assessment [7], we have met all maturity criteria
> as
> > > > >> outlined in [1].
> > > > >>
> > > > >> And below is the drafted graduation resolution, JFYI:
> > > > >> --
> > > > >> Establish the Apache Celeborn Project
> > > > >>
> > > > >> WHEREAS, the Board of Directors deems it to be in the best
> interests
> > > of
> > > > >> the Foundation and consistent with the Foundation's purpose to
> > > establish
> > > > >> a Project Management Committee charged with the creation and
> > > maintenance
> > > > >> of open-source software, for distribution at no charge to the
> public,
> > > > >> related to an intermediate data service for big data computing
> engines
> > > > >> to boost performance, stability, and flexibility.
> > > > >>
> > > > >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee
> > > > >> (PMC), to be known as the "Apache Celeborn Project", be and
> hereby is
> > > > >> established pursuant to Bylaws of the Foundation; and be it
> further
> > > > >>
> > > > >> RESOLVED, that the Apache Celeborn Project be and hereby is
> > > responsible
> > > > >> for the creation and maintenance of software related to an
> > > intermediate
> > > > >> data service for big data computing engines to boost performance,
> > > > >> stability, and flexibility; and be it further
> > > > >>
> > > > >> RESOLVED, that the office of "Vice President, Apache Celeborn" be
> and
> > > > >> hereby is created, the person holding such office to serve at the
> > > > >> direction of the Board of Directors as the chair of the Apache
> > > Celeborn
> > > > >> Project, and to have primary responsibility for management of the
> > > > >> projects within the scope of responsibility of the Apache Celeborn
> > > > >> Project; and be it further
> > > > >>
> > > > >> RESOLVED, that the persons listed immediately below be and hereby
> are
> > > > >> appointed to serve as the initial members of the Apache Celeborn
> > > > >> Project:
> > > > >>
> > > > >> * Becket Qin
> > > > >> * Cheng Pan 
> > > > >> * Duo Zhang 
> > > > >> * Ethan Feng
> > > > >> * Fu Chen   
> > > > >> * Jiashu Xiong  
> > > >

Re: [ANNONCE] New PPMC member: Fu Chen

2024-02-19 Thread Mridul Muralidharan

Congratulations !

Regards,
Mridul

On Mon, Feb 19, 2024 at 7:46 PM Cheng Pan  wrote:

> Congrats!
>
> Thanks,
> Cheng Pan
>
>
> > On Feb 20, 2024, at 08:22, Nicholas  wrote:
> >
> > Congratulations to Fu Chen!Regards,
> > Nicholas Jiang
> >
> >
> >
> >
> > At 2024-02-20 00:23:06, "Shaoyun Chen"  wrote:
> >> Congratulations!
> >>
> >> Keyong Zhou  于2024年2月19日周一 21:16写道：
> >>>
> >>> Hi Celeborn Community,
> >>>
> >>> The Podling Project Management Committee (PPMC) for Apache Celeborn
> >>> has invited Fu Chen to become our PPMC member and
> >>> we are pleased to announce that he has accepted.
> >>>
> >>> Fu Chen has been actively contributing to Celeborn community for more
> then
> >>> one year[1], including SBT build,
> >>> performance improvement, code refactor, bug fixes, code reviews, design
> >>> discussion, docs, etc.
> >>>
> >>> Please join me in congratulating Fu Chen!
> >>>
> >>> Being a committer enables easier contribution to the
> >>> project since there is no need to go via the patch
> >>> submission process. This should enable better productivity.
> >>> A PPMC member helps manage and guide the direction of the project.
> >>>
> >>> [1]
> https://github.com/apache/incubator-celeborn/commits?author=cfmcgrady
> >>>
> >>> Thanks,
> >>> On behalf of the Apache Celeborn PPMC
>
>

Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan

Hi,

  I am fine with either actually - though more used to jira personally :-)
(github issues has a nice integrations with pr's which has been useful
though).
The main reason why I asked is what Nicholas clarified about - saw a
nontrivial number of github issue related mails, and was not sure if we
were moving to using that !

Thanks,
Mridul


On Wed, Feb 7, 2024 at 12:52 AM Keyong Zhou  wrote:

> Hi Mridul,
>
> Thanks for asking. In fact at the time when donating Celeborn to ASF
> incubator we had a discussion whether to use JIRA or
> Github for issue tracking and we decided to choose JIRA at last. Seems
> different projects have different preferences. Maybe
> newer projects tends to use Github.
>
> To me, I'm actually fine with both. JIRA works well so far, will using
> Github be more beneficial? Glad to hear about your opinion.
>
> Thanks,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年2月7日周三 14:03写道：
>
> >   Looks like I am wrong, github issues can be used [1].
> > Is Celeborn planning to use github issues going forward ?
> >
> > Regards,
> > Mridul
> >
> >
> > [1] https://www.apache.org/dev/#issues
> >
> >
> > On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
> > wrote:
> >
> > > Hi,
> > >
> > >   I received a fairly large number of emails to
> > > incubator-celeb...@noreply.github.com, which typically are for PR's.
> > > They appear to be github issues - are we trying to move to github
> issues
> > > instead of Apache jira ? IIRC there is a policy to use jira for
> tracking
> > > bugs/improvements, right ?
> > >
> > > Regards,
> > > Mridul
> > >
> >
>

Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan

  Looks like I am wrong, github issues can be used [1].
Is Celeborn planning to use github issues going forward ?

Regards,
Mridul

[1] https://www.apache.org/dev/#issues

On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
wrote:

> Hi,
>
>   I received a fairly large number of emails to
> incubator-celeb...@noreply.github.com, which typically are for PR's.
> They appear to be github issues - are we trying to move to github issues
> instead of Apache jira ? IIRC there is a policy to use jira for tracking
> bugs/improvements, right ?
>
> Regards,
> Mridul
>

Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan

Hi,

  I received a fairly large number of emails to
incubator-celeb...@noreply.github.com, which typically are for PR's.
They appear to be github issues - are we trying to move to github issues
instead of Apache jira ? IIRC there is a policy to use jira for tracking
bugs/improvements, right ?

Regards,
Mridul

[jira] [Resolved] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46512.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44512
[https://github.com/apache/spark/pull/44512]

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46512:
---

Assignee: Chenyu Zheng

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46733) Simplify the ContextCleaner|BlockManager by the exit operation only depend on interrupt thread

2024-01-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46733:
---

Assignee: Jiaan Geng

> Simplify the ContextCleaner|BlockManager by the exit operation only depend on 
> interrupt thread
> --
>
> Key: SPARK-46733
> URL: https://issues.apache.org/jira/browse/SPARK-46733
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46733) Simplify the ContextCleaner|BlockManager by the exit operation only depend on interrupt thread

2024-01-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46733.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44732
[https://github.com/apache/spark/pull/44732]

> Simplify the ContextCleaner|BlockManager by the exit operation only depend on 
> interrupt thread
> --
>
> Key: SPARK-46733
> URL: https://issues.apache.org/jira/browse/SPARK-46733
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808080#comment-17808080
 ] 

Mridul Muralidharan commented on SPARK-46623:
-

Issue resolved by pull request 44616
https://github.com/apache/spark/pull/44616

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46623) Replace SimpleDateFormat with DateTimeFormatter

2024-01-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46623.
-
Fix Version/s: 4.0.0
 Assignee: Jiaan Geng
   Resolution: Fixed

> Replace SimpleDateFormat with DateTimeFormatter
> ---
>
> Key: SPARK-46623
> URL: https://issues.apache.org/jira/browse/SPARK-46623
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46696.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44705
[https://github.com/apache/spark/pull/44705]

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46696:
---

Assignee: liangyongyuan

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Mridul Muralidharan

Hi,

  We are internally exploring adding support for dynamically changing the
resource profile of a stage based on runtime characteristics.
This includes failures due to OOM and the like, slowness due to excessive
GC, resource wastage due to excessive overprovisioning, etc.
Essentially handles scale up and scale down of resources.
Instead of baking these into the scheduler directly (which is already
complex), we are modeling it as a plugin - so that the 'business logic' of
how to handle task events and mutate state is pluggable.

The main limitation I find with mutating only the cores is the limits it
places on what kind of problems can be solved with it - and mutating
resource profiles is a much more natural way to handle this
(spark.task.cpus predates RP).

Regards,
Mridul

On Wed, Jan 17, 2024 at 9:18 AM Tom Graves 
wrote:

> It is interesting. I think there are definitely some discussion points
> around this.  reliability vs performance is always a trade off and its
> great it doesn't fail but if it doesn't meet someone's SLA now that could
> be as bad if its hard to figure out why.   I think if something like this
> kicks in, it needs to be very obvious to the user so they can see that it
> occurred.  Do you have something in place on UI or something that indicates
> this? The nice thing is also you aren't wasting memory by increasing it for
> all tasks when maybe you only need it for one or two.  The downside is you
> are only finding out after failure.
>
> I do also worry a little bit that in your blog post, the error you pointed
> out isn't a java OOM but an off heap memory issue (overhead + heap usage).
> You don't really address heap memory vs off heap in that article.  Only
> thing I see mentioned is spark.executor.memory which is heap memory.
> Obviously adjusting to only run one task is going to give that task more
> overall memory but the reasons its running out in the first place could be
> different.  If it was on heap memory for instance with more tasks I would
> expect to see more GC and not executor OOM.  If you are getting executor
> OOM you are likely using more off heap memory/stack space, etc then you
> allocated.   Ultimately it would be nice to know why that is happening and
> see if we can address it to not fail in the first place.  That could be
> extremely difficult though, especially if using software outside Spark that
> is using that memory.
>
> As Holden said,  we need to make sure this would play nice with the
> resource profiles, or potentially if we can use the resource profile
> functionality.  Theoretically you could extend this to try to get new
> executor if using dynamic allocation for instance.
>
> I agree doing a SPIP would be a good place to start to have more
> discussions.
>
> Tom
>
> On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan <
> justfors...@gmail.com> wrote:
>
>
> Hello All,
>
> At Uber, we had recently, done some work on improving the reliability of
> spark applications in scenarios of fatter executors going out of memory and
> leading to application failure. Fatter executors are those that have more
> than 1 task running on it at a given time concurrently. This has
> significantly improved the reliability of many spark applications for us at
> Uber. We made a blog about this recently. Link:
> https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
>
> At a high level, we have done the below changes:
>
>1. When a Task fails with the OOM of an executor, we update the core
>requirements of the task to max executor cores.
>2. When the task is picked for rescheduling, the new attempt of the
>task happens to be on an executor where no other task can run concurrently.
>All cores get allocated to this task itself.
>3. This way we ensure that the configured memory is completely at the
>disposal of a single task. Thus eliminating contention of memory.
>
> The best part of this solution is that it's reactive. It kicks in only
> when the executors fail with the OOM exception.
>
> We understand that the problem statement is very common and we expect our
> solution to be effective in many cases.
>
> There could be more cases that can be covered. Executor failing with OOM
> is like a hard signal. The framework(making the driver aware of
> what's happening with the executor) can be extended to handle scenarios of
> other forms of memory pressure like excessive spilling to disk, etc.
>
> While we had developed this on Spark 2.4.3 in-house, we would like to
> collaborate and contribute this work to the latest versions of Spark.
>
> What is the best way forward here? Will an SPIP proposal to detail the
> changes help?
>
> Regards,
> Kalyan.
> Uber India.
>

[jira] [Assigned] (SPARK-46399) Add exit status to the Application End event for the use of Spark Listener

2023-12-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46399:
---

Assignee: Reza Safi

> Add exit status to the Application End event for the use of Spark Listener
> --
>
> Key: SPARK-46399
> URL: https://issues.apache.org/jira/browse/SPARK-46399
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
>  Labels: pull-request-available
>
> Currently SparkListenerApplicationEnd only has a timestamp value and there is 
> not exit status recorded with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46399) Add exit status to the Application End event for the use of Spark Listener

2023-12-20 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46399.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44340
[https://github.com/apache/spark/pull/44340]

> Add exit status to the Application End event for the use of Spark Listener
> --
>
> Key: SPARK-46399
> URL: https://issues.apache.org/jira/browse/SPARK-46399
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently SparkListenerApplicationEnd only has a timestamp value and there is 
> not exit status recorded with it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.2-incubating-rc0

2023-12-19 Thread Mridul Muralidharan

+1

Signatures, digests, license, etc check out fine.
Checked out tag and build/tested with -Pspark3.1 and -Pflink-1.17

Regards,
Mridul


On Tue, Dec 19, 2023 at 8:06 PM rexxiong  wrote:

> +1 (binding)
> I checked
> - Download links are valid.
> - git commit hash is correct
> - Checksums and signatures are valid.
> - No binary files in the source release
> - Files have the word incubating in their name.
> - DISCLAIMER,LICENSE and NOTICE files exist.
> - Successfully built the binary from the source on MacOs with Command:
> ./build/make-distribution.sh --release
>
> Thanks,
> Jiashu Xiong
>
> Fu Chen  于2023年12月20日周三 00:18写道：
>
> > +1
> >
> > I checked
> > - download links are valid.
> > - git commit hash is correct.
> > - no binary files in the source release.
> > - signatures are good.
> > ```
> > gpg --import KEYS
> > gpg --verify apache-celeborn-0.3.2-incubating-source.tgz.asc
> > gpg --verify apache-celeborn-0.3.2-incubating-bin.tgz.asc
> > ```
> > - checksums are good.
> > ```
> > sha512sum --check apache-celeborn-0.3.2-incubating-source.tgz.sha512
> > sha512sum --check apache-celeborn-0.3.2-incubating-bin.tgz.sha512
> > ```
> > - build success from source code (Pop!_OS 22.04 LTS).
> > ```
> > ./build/mvn clean package -DskipTests -Pspark-3.4
> > ```
> >
> > Shaoyun Chen  于2023年12月19日周二 22:45写道：
> >
> > > +1 (non-binding)
> > >
> > > I checked the following things:
> > >
> > > - signatures are good.
> > > ```
> > > gpg --import KEYS
> > > gpg --verify apache-celeborn-0.3.2-incubating-source.tgz.asc
> > > gpg --verify apache-celeborn-0.3.2-incubating-bin.tgz.asc
> > > ```
> > > - checksums are good.
> > > ```
> > > sha512sum --check apache-celeborn-0.3.2-incubating-source.tgz.sha512
> > > sha512sum --check apache-celeborn-0.3.2-incubating-bin.tgz.sha512
> > > ```
> > > - build success from source code.
> > > ```
> > > ./build/make-distribution.sh -Pspark-3.2
> > > ./build/make-distribution.sh --release
> > > ```
> > >
> > > Yihe Li  于2023年12月19日周二 20:56写道：
> > > >
> > > > +1 (non-binding)
> > > > I checked the following things:
> > > > - git commit hash is correct.
> > > > - download links are valid.
> > > > - release files are in correct location.
> > > > - release files have the word incubating in their name.
> > > > - signatures and checksums are good.
> > > > - DISCLAIMER, LICENSE and NOTICE files exist.
> > > > - build success from source code(ubuntu 16.04).
> > > > ```
> > > > ./build/make-distribution.sh --release
> > > > ./build/make-distribution.sh --Pspark-3.3
> > > > ```
> > > >
> > > > Thanks,
> > > > Yihe Li
> > > >
> > > > On 2023/12/19 06:37:40 Ethan Feng wrote:
> > > > > +1(binding)
> > > > >
> > > > > I checked:
> > > > > √ release files in the correct location
> > > > > √ release files have the word incubating
> > > > > √ digital signature and hashes correct
> > > > > √ DISCLAIMER file exist
> > > > > √ LICENSE and NOTICE files exist and are correct
> > > > > √ the contents of the release match the tag in VCS
> > > > > √ can build the release from the source
> > > > > √ maven artifacts looks correct
> > > > >
> > > > > Thanks,
> > > > > Ethan Feng
> > > > >
> > > > > Nicholas Jiang  于2023年12月19日周二 12:32写道：
> > > > > >
> > > > > > Hi Celeborn community,
> > > > > >
> > > > > >
> > > > > > This is a call for a vote to release Apache Celeborn (Incubating)
> > > > > > 0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The git tag to be voted upon:
> > > > > >
> > >
> >
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The git commit hash:
> > > > > > d43411b22adf24679c27004a08e813ab278eaaa3 source and binary
> > artifacts
> > > can be
> > > > > > found at:
> > > > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The staging repo:
> > > > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1041
> > > > > >
> > > > > >
> > > > > > Fingerprint of the PGP key release artifacts are signed with:
> > > > > > D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> > > > > >
> > > > > >
> > > > > > My public key to verify signatures can be found in:
> > > > > >
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> > > > > >
> > > > > >
> > > > > > The vote will be open for at least 72 hours or until the
> necessary
> > > > > > number of votes are reached.
> > > > > >
> > > > > >
> > > > > > Please vote accordingly:
> > > > > >
> > > > > >
> > > > > > [ ] +1 approve
> > > > > > [ ] +0 no opinion
> > > > > > [ ] -1 disapprove (and the reason)
> > > > > >
> > > > > >
> > > > > > Checklist for release:
> > > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> > > > > >
> > > > > >
> > > > > > Steps to validate the release:
> > > > > > https://www.apache.org/info/verification.html
> > > > > >
> > > > > >
> > > > > > * Download links, checksums and PGP

[jira] [Resolved] (SPARK-46132) [CORE] Support key password for JKS keys for RPC SSL

2023-12-12 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46132.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44264
[https://github.com/apache/spark/pull/44264]

> [CORE] Support key password for JKS keys for RPC SSL
> 
>
> Key: SPARK-46132
> URL: https://issues.apache.org/jira/browse/SPARK-46132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> See thread at 
> https://github.com/apache/spark/pull/43998#discussion_r1406993411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46132) [CORE] Support key password for JKS keys for RPC SSL

2023-12-12 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46132:
---

Assignee: Hasnain Lakhani

> [CORE] Support key password for JKS keys for RPC SSL
> 
>
> Key: SPARK-46132
> URL: https://issues.apache.org/jira/browse/SPARK-46132
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> See thread at 
> https://github.com/apache/spark/pull/43998#discussion_r1406993411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-11 Thread Mridul Muralidharan

I am seeing a bunch of python related (43) failures in the sql module (for
example [1]) ... I am currently on Python 3.11.6, java 8.
Not sure if ubuntu modified anything from under me, thoughts ?

I am currently testing this against an older branch to make sure it is not
an issue with my desktop.

Regards,
Mridul


[1]


org.apache.spark.sql.IntegratedUDFTestUtils.shouldTestGroupedAggPandasUDFs
was false (QueryCompilationErrorsSuite.scala:112)
Traceback (most recent call last):
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 73, in dumps
cp.dump(obj)
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 602, in dump
return Pickler.dump(self, obj)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 692, in reducer_override
return self._function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 546, in _dynamic_function_reduce
state = _function_getstate(func)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in 
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
 ~^^^
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 458, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 73, in dumps
cp.dump(obj)
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 602, in dump
return Pickler.dump(self, obj)
   ^^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 692, in reducer_override
return self._function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 565, in _function_reduce
return self._dynamic_function_reduce(obj)
   ^^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 546, in _dynamic_function_reduce
state = _function_getstate(func)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 157, in _function_getstate
f_globals_ref = _extract_code_globals(func.__code__)

  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in _extract_code_globals
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
^
  File
"/home/mridul/work/apache/vote/spark/python/pyspark/cloudpickle/cloudpickle.py",
line 334, in 
out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
 ~^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/home/mridul/work/apache/vote/spark/python/pyspark/serializers.py",
line 468, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index
out of range
- UNSUPPORTED_FEATURE: Using Python UDF with unsupported join condition ***
FAILED ***



On Sun, Dec 10, 2023 at 9:05 PM L. C. Hsieh  wrote:

> +1
>
> On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
> >
> > +1(non-binding
> >
> > Kent Yao
> >
> > Yuming Wang  于2023年12月11日周一 09:33写道：
> > >
> > > +1
> > >
> > > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun 
> wrote:
> > >>
>

Re: [DISCUSS] Time for 0.3.2

2023-12-06 Thread Mridul Muralidharan

+1 on 0.3.2, thanks Nicholas !

Regards,
Mridul


On Thu, Dec 7, 2023 at 12:51 AM Cheng Pan  wrote:

> +1, thanks for volunteering.
>
> Feel free to ping me if you encounter permission issues during the release
> phase.
>
> Thanks,
> Cheng Pan
>
>
> > On Dec 7, 2023, at 14:31, Nicholas  wrote:
> >
> > Hey, Celeborn community,
> >
> > It has been a while since the 0.3.1 release, and there are some critical
> fixes land branch-0.3, for example, [CELEBORN-1037] Incorrect output for
> metrics of Prometheus. From my perspective, it’s time to prepare for
> releasing 0.3.2.
> >
> > WDYT? And I’m volunteering to be the release manager if no one has
> applied.
> >
> > Regards,
> > Nicholas Jiang
>
>
>

[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-46058:

Labels: pull-request-available  (was: pull-request-available releasenotes)

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-46058:

Labels: pull-request-available releasenotes  (was: pull-request-available)

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46058.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43998
[https://github.com/apache/spark/pull/43998]

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46058) [CORE] Add separate flag for privateKeyPassword

2023-12-06 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46058:
---

Assignee: Hasnain Lakhani

> [CORE] Add separate flag for privateKeyPassword
> ---
>
> Key: SPARK-46058
> URL: https://issues.apache.org/jira/browse/SPARK-46058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Right now with config inheritance we support:
>  * JKS with password A, PEM with password B
>  * JKS with no password, PEM with password A
>  * JKS and PEM with no password
>  
> But we do not support the case where JKS has a password and PEM does not. If 
> we set keyPassword we will attempt to use it, and cannot set 
> `spark.ssl.rpc.keyPassword` to null. So let's make it a separate flag as the 
> easiest workaround.
>  
> This was noticed while migrating some existing deployments to the RPC SSL 
> support where we use openssl support for RPC and use a key with no password



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Apache Spark 3.3.4 EOL Release?

2023-12-04 Thread Mridul Muralidharan

+1

Regards,
Mridul

On Mon, Dec 4, 2023 at 11:40 AM L. C. Hsieh  wrote:

> +1
>
> Thanks Dongjoon!
>
> On Mon, Dec 4, 2023 at 9:26 AM Yang Jie  wrote:
> >
> > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> >
> > Jie Yang
> >
> > On 2023/12/04 15:08:25 Tom Graves wrote:
> > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> > > Tom
> > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> > >
> > >  Hi, All.
> > >
> > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
> branch-3.3 has been maintained and served well until now.
> > >
> > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on Jun
> 9th, 2022)
> > > - https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
> (vote result on June 14th, 2022)
> > >
> > > As of today, branch-3.3 has 56 additional patches after v3.3.3 (tagged
> on Aug 3rd about 4 month ago) and reaches the end-of-life this month
> according to the Apache Spark release cadence,
> https://spark.apache.org/versioning-policy.html .
> > >
> > > $ git log --oneline v3.3.3..HEAD | wc -l
> > > 56
> > >
> > > Along with the recent Apache Spark 3.4.2 release, I hope the users can
> get a chance to have these last bits of Apache Spark 3.3.x, and I'd like to
> propose to have Apache Spark 3.3.4 EOL Release vote on December 11th and
> volunteer as the release manager.
> > >
> > > WDTY?
> > >
> > > Please let us know if you need more patches on branch-3.3.
> > >
> > > Thanks,
> > > Dongjoon.
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.4.2 (RC1)

2023-11-29 Thread Mridul Muralidharan

+1

Signatures, digests, etc check out fine.
Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes

Regards,
Mridul

On Wed, Nov 29, 2023 at 5:08 AM Yang Jie  wrote:

> +1(non-binding)
>
> Jie Yang
>
> On 2023/11/29 02:08:04 Kent Yao wrote:
> > +1(non-binding)
> >
> > Kent Yao
> >
> > On 2023/11/27 01:12:53 Dongjoon Hyun wrote:
> > > Hi, Marc.
> > >
> > > Given that it exists in 3.4.0 and 3.4.1, I don't think it's a release
> > > blocker for Apache Spark 3.4.2.
> > >
> > > When the patch is ready, we can consider it for 3.4.3.
> > >
> > > In addition, note that we categorized release-blocker-level issues by
> > > marking 'Blocker' priority with `Target Version` before the vote.
> > >
> > > Best,
> > > Dongjoon.
> > >
> > >
> > > On Sat, Nov 25, 2023 at 12:01 PM Marc Le Bihan 
> wrote:
> > >
> > > > -1 If you can wait that the last remaining problem with Generics (?)
> is
> > > > entirely solved, that causes this exception to be thrown :
> > > >
> > > > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be
> cast to class [Ljava.lang.reflect.TypeVariable; ([Ljava.lang.Object; and
> [Ljava.lang.reflect.TypeVariable; are in module java.base of loader
> 'bootstrap')
> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:116)
> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
> > > > at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:929)
> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
> > > > at
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
> > > > at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
> > > > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
> > > > at org.apache.spark.sql.Encoders.bean(Encoders.scala)
> > > >
> > > >
> > > > https://issues.apache.org/jira/browse/SPARK-45311
> > > >
> > > > Thanks !
> > > >
> > > > Marc Le Bihan
> > > >
> > > >
> > > > On 25/11/2023 11:48, Dongjoon Hyun wrote:
> > > >
> > > > Please vote on releasing the following candidate as Apache Spark
> version
> > > > 3.4.2.
> > > >
> > > > The vote is open until November 30th 1AM (PST) and passes if a
> majority +1
> > > > PMC votes are cast, with a minimum of 3 +1 votes.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 3.4.2
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> https://spark.apache.org/
> > > >
> > > > The tag to be voted on is v3.4.2-rc1 (commit
> > > > 0c0e7d4087c64efca259b4fb656b8be643be5686)
> > > > https://github.com/apache/spark/tree/v3.4.2-rc1
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-bin/
> > > >
> > > > Signatures used for Spark RCs can be found in this file:
> > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> https://repository.apache.org/content/repositories/orgapachespark-1450/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > https://dist.apache.org/repos/dist/dev/spark/v3.4.2-rc1-docs/
> > > >
> > > > The list of bug fixes going into 3.4.2 can be found at the following
> URL:
> > > > https://issues.apache.org/jira/projects/SPARK/versions/12353368
> > > >
> > > > This release is using the release script of the tag v3.4.2-rc1.
> > > >
> > > > FAQ
> > > >
> > > > =
> > > > How can I help test this release?
> > > > =
> > > >
> > > > If you are a Spark user, you can help us test this release by taking
> > > > an existing Spark workload and running on this release candidate,
> then
> > > > reporting any regressions.
> > > >
> > > > If you're working in PySpark you can set up a virtual env and install
> > > > the current RC and see if anything important breaks, in the
> Java/Scala
> > > > you can add the staging repository to your projects resolvers and
> test
> > > > with the RC (make sure to clean up the artifact cache before/after so
> > > > you don't end up building with a out of date RC going forward).
> > > >
> > > > ===
> > > > What should happen to JIRA tickets still targeting 3.4.2?
> > > > ===
> > > >
> > > > The current list of open tickets targeted at 3.4.2 can be found at:
> > > > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > > > Version/s" = 3.4.2
> > > >
> > > > Committers should look at those and triage. Extremely important bug
> > > > fixes, documentation, and API tweaks that impact

Re: [VOTE] SPIP: Testing Framework for Spark UI Javascript files

2023-11-24 Thread Mridul Muralidharan

+1

Regards,
Mridul

On Fri, Nov 24, 2023 at 8:21 AM Kent Yao  wrote:

> Hi Spark Dev,
>
> Following the discussion [1], I'd like to start the vote for the SPIP [2].
>
> The SPIP aims to improve the test coverage and develop experience for
> Spark UI-related javascript codes.
>
> This thread will be open for at least the next 72 hours.  Please vote
> accordingly,
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
> Kent Yao
>
> [1] https://lists.apache.org/thread/5rqrho4ldgmqlc173y2229pfll5sgkff
> [2]
> https://docs.google.com/document/d/1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/edit?usp=sharing
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [ANNOUNCE] Add Yihe Li as new committer

2023-11-22 Thread Mridul Muralidharan

Congratulations Yihe Li !

Regards,
Mridul

On Wed, Nov 22, 2023 at 2:08 AM Yu Li  wrote:

> Congratulations, Yihe!
>
> Best Regards,
> Yu
>
>
> On Fri, 17 Nov 2023 at 15:32, Shaoyun Chen  wrote:
>
> > Congrats!
> >
> > Keyong Zhou  于2023年11月16日周四 20:25写道：
> > >
> > > Hi Celeborn Community,
> > >
> > > The Podling Project Management Committee (PPMC) for Apache Celeborn
> > > has invited Yihe Li to become a committer and we are pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A (P)PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Yihe Li!
> > >
> > > Thanks,
> > > Keyong Zhou
> >
>

Re: [DISCUSS] SPIP: Testing Framework for Spark UI Javascript files

2023-11-21 Thread Mridul Muralidharan

This should be a very good addition !

Regards,
Mridul

On Tue, Nov 21, 2023 at 7:46 PM Dongjoon Hyun 
wrote:

> Thank you for proposing a new UI test framework for Apache Spark 4.0.
>
> It looks very useful.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Nov 21, 2023 at 1:51 AM Kent Yao  wrote:
>
>> Hi Spark Dev,
>>
>> This is a call to discuss a new SPIP: Testing Framework for
>> Spark UI Javascript files [1]. The SPIP aims to improve the test
>> coverage and develop experience for Spark UI-related javascript
>> codes.
>> The Jest [2], a JavaScript Testing Framework licensed under MIT, will
>> be used to build this dev and test-only module.
>> There is also a W.I.P. pull request [3] to show what it would be like.
>>
>> This thread will be open for at least the next 72 hours. Suggestions
>> are welcome.If there is no veto found, I will close this thread after
>> 2023-11-24 18:00(+08:00) and raise a new thread for voting.
>>
>> Thanks,
>> Kent Yao
>>
>> [1]
>> https://docs.google.com/document/d/1hWl5Q2CNNOjN5Ubyoa28XmpJtDyD9BtGtiEG2TT94rg/edit?usp=sharing
>> [2] https://github.com/jestjs/jest
>> [3] https://github.com/apache/spark/pull/43903
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

[jira] [Assigned] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45762:
---

Assignee: Alessandro Bellina

> Shuffle managers defined in user jars are not available for some launch modes
> -
>
> Key: SPARK-45762
> URL: https://issues.apache.org/jira/browse/SPARK-45762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Starting a spark job in standalone mode with a custom `ShuffleManager` 
> provided in a jar via `--jars` does not work. This can also be experienced in 
> local-cluster mode.
> The approach that works consistently is to copy the jar containing the custom 
> `ShuffleManager` to a specific location in each node then add it to 
> `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we 
> would like to move away from setting extra configurations unnecessarily.
> Example:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> This yields `java.lang.ClassNotFoundException` in the executors.
> {code:java}
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.examples.TestShuffleManager
>   at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
>   at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>   at java.base/java.lang.Class.forName0(Native Method)
>   at java.base/java.lang.Class.forName(Class.java:467)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
>   at 
> org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   ... 4 more
> {code}
> We can change our command to use `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --conf spark.driver.extraClassPath=user-code.jar \
>  --conf spark.executor.extraClassPath=user-code.jar
> {code}
> Success after adding the jar to `extraClassPath`:
> {code:java}
> 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created 
> connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
> 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
> 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
> {code}
> We would like to change startup order such that the original command 
> succeeds, without specifying `extraClassPath`:
> {code:java}

[jira] [Resolved] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45762.
-
Resolution: Fixed

Issue resolved by pull request 43627
[https://github.com/apache/spark/pull/43627]

> Shuffle managers defined in user jars are not available for some launch modes
> -
>
> Key: SPARK-45762
> URL: https://issues.apache.org/jira/browse/SPARK-45762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Starting a spark job in standalone mode with a custom `ShuffleManager` 
> provided in a jar via `--jars` does not work. This can also be experienced in 
> local-cluster mode.
> The approach that works consistently is to copy the jar containing the custom 
> `ShuffleManager` to a specific location in each node then add it to 
> `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we 
> would like to move away from setting extra configurations unnecessarily.
> Example:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> This yields `java.lang.ClassNotFoundException` in the executors.
> {code:java}
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.examples.TestShuffleManager
>   at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
>   at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>   at java.base/java.lang.Class.forName0(Native Method)
>   at java.base/java.lang.Class.forName(Class.java:467)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
>   at 
> org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   ... 4 more
> {code}
> We can change our command to use `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --conf spark.driver.extraClassPath=user-code.jar \
>  --conf spark.executor.extraClassPath=user-code.jar
> {code}
> Success after adding the jar to `extraClassPath`:
> {code:java}
> 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created 
> connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
> 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
> 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
> {code}
> We would like to change startup order such that the original command 
>

[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2023-11-14 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786077#comment-17786077
 ] 

Mridul Muralidharan edited comment on SPARK-30602 at 11/14/23 8:50 PM:
---

[~MasterDDT], currently there is no active plans to add support for it.


was (Author: mridulm80):
@MasterDDT, currently there is no active plans to add support for it.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2023-11-14 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786077#comment-17786077
 ] 

Mridul Muralidharan commented on SPARK-30602:
-

@MasterDDT, currently there is no active plans to add support for it.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Mridul Muralidharan

+1

Regards,
Mridul

On Tue, Nov 14, 2023 at 12:45 PM Holden Karau  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>
>> +1
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>>
>> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
>>> >
>>> > +1
>>> >
>>> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
>>> > >
>>> > > +1(Non-binding)
>>> > >
>>> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
>>> wrote:
>>> > >>
>>> > >> Hi all,
>>> > >>
>>> > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator
>>> for
>>> > >> Apache Spark.
>>> > >>
>>> > >> The proposal is to develop an official Java-based Kubernetes
>>> operator
>>> > >> for Apache Spark to automate the deployment and simplify the
>>> lifecycle
>>> > >> management and orchestration of Spark applications and Spark
>>> clusters
>>> > >> on k8s at prod scale.
>>> > >>
>>> > >> This aims to reduce the learning curve and operation overhead for
>>> > >> Spark users so they can concentrate on core Spark logic.
>>> > >>
>>> > >> Please also refer to:
>>> > >>
>>> > >>- Discussion thread:
>>> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>>> > >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>>> > >>- SPIP doc:
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>> > >>
>>> > >>
>>> > >> Please vote on the SPIP for the next 72 hours:
>>> > >>
>>> > >> [ ] +1: Accept the proposal as an official SPIP
>>> > >> [ ] +0
>>> > >> [ ] -1: I don’t think this is a good idea because …
>>> > >>
>>> > >>
>>> > >> Thank you!
>>> > >>
>>> > >> Liang-Chi Hsieh
>>> > >>
>>> > >>
>>> -
>>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > >
>>> > > Zhou, Ye  周晔
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>

[jira] [Commented] (SPARK-44937) Add SSL/TLS support for RPC and Shuffle communications

2023-11-14 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786076#comment-17786076
 ] 

Mridul Muralidharan commented on SPARK-44937:
-

Thanks [~dongjoon] ! Missed out on that :-)

> Add SSL/TLS support for RPC and Shuffle communications
> --
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: releasenotes
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45911) [CORE] Make TLS1.3 the default for RPC SSL support

2023-11-14 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45911:
---

Assignee: Hasnain Lakhani

> [CORE] Make TLS1.3 the default for RPC SSL support
> --
>
> Key: SPARK-45911
> URL: https://issues.apache.org/jira/browse/SPARK-45911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Make TLS1.3 the default to improve security



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45911) [CORE] Make TLS1.3 the default for RPC SSL support

2023-11-14 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45911.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43803
[https://github.com/apache/spark/pull/43803]

> [CORE] Make TLS1.3 the default for RPC SSL support
> --
>
> Key: SPARK-45911
> URL: https://issues.apache.org/jira/browse/SPARK-45911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make TLS1.3 the default to improve security



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44937) [umbrella] Add SSL/TLS support for RPC and Shuffle communications

2023-11-07 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-44937:

Epic Name: ssl/tls for rpc and shuffle

> [umbrella] Add SSL/TLS support for RPC and Shuffle communications
> -
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44937) [umbrella] Add SSL/TLS support for RPC and Shuffle communications

2023-11-07 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-44937.
-
Fix Version/s: (was: 3.4.2)
   (was: 3.5.1)
   (was: 3.3.4)
   Resolution: Fixed

The SSL feature itself landed in 4.0 - so marking the feature as available for 
4.0 and removing 3.4.2, 3.5.1, 3.3.4 from fix version (some of the PR's linked 
were merged to older version - and unfortunately this umbrella jira got reused 
for them).


> [umbrella] Add SSL/TLS support for RPC and Shuffle communications
> -
>
> Key: SPARK-44937
> URL: https://issues.apache.org/jira/browse/SPARK-44937
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Security, Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for SSL/TLS based communication for Spark RPCs and block 
> transfers - providing an alternative to the existing encryption / 
> authentication implementation documented at 
> [https://spark.apache.org/docs/latest/security.html#spark-rpc-communication-protocol-between-spark-processes]
> This is a superset of the functionality discussed in 
> https://issues.apache.org/jira/browse/SPARK-6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45431) [DOCS] Document SSL RPC feature

2023-11-07 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45431.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43240
[https://github.com/apache/spark/pull/43240]

> [DOCS] Document SSL RPC feature
> ---
>
> Key: SPARK-45431
> URL: https://issues.apache.org/jira/browse/SPARK-45431
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add documentation for users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45431) [DOCS] Document SSL RPC feature

2023-11-07 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45431:
---

Assignee: Hasnain Lakhani

> [DOCS] Document SSL RPC feature
> ---
>
> Key: SPARK-45431
> URL: https://issues.apache.org/jira/browse/SPARK-45431
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Add documentation for users



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-11-03 Thread Mridul Muralidharan

Yes, DAGScheduler is dealing with it at a stage level - and so individual
RDD’s DeterministicLevel  would be handled in order to determine the
stage’s level.

Regards,
Mridul


On Fri, Nov 3, 2023 at 9:45 AM Keyong Zhou  wrote:

> I checked RDD#getOutputDeterministicLevel and find that if an RDD's
> upstream is INDETERMINATE,
> then it's also INDETERMINATE.
>
> Thanks,
> Keyong Zhou
>
> Keyong Zhou  于2023年11月3日周五 19:57写道：
>
> > Hi Mridul,
> >
> > I still have a question. DAGScheduler#submitMissingTasks will
> > only unregisterAllMapAndMergeOutput
> > if the current ShuffleMapStage is Indeterminate. What if the current
> stage
> > is determinate, but its
> > upstream stage is Indeterminate, and its upstream stage is rerun?
> >
> > Thanks,
> > Keyong Zhou
> >
> > Mridul Muralidharan  于2023年10月20日周五 11:15写道：
> >
> >> To add my response - what I described (w.r.t failing job) applies only
> to
> >> ResultStage.
> >> It walks the lineage DAG to identify all indeterminate parents to
> >> rollback.
> >> If there are only ShuffleMapStages in the set of stages to rollback, it
> >> will simply discard their output, rollback all of them, and then retry
> >> these stages (same shuffle-id, a new stage attempt)
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
> >> wrote:
> >>
> >> >
> >> > Good question, and ResultStage is actually special cased in spark as
> its
> >> > output could have already been consumed (for example collect() to
> >> driver,
> >> > etc) - and so if it is one of the stages which needs to be rolled
> back,
> >> the
> >> > job is aborted.
> >> >
> >> > To illustrate, see the following:
> >> > -- snip --
> >> >
> >> > package org.apache.spark
> >> >
> >> >
> >> > import scala.reflect.ClassTag
> >> >
> >> > import org.apache.spark._
> >> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
> >> >
> >> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
> >> RDD[E](delegate) {
> >> >
> >> >   override def compute(split: Partition, context: TaskContext):
> >> Iterator[E] = {
> >> > delegate.compute(split, context)
> >> >   }
> >> >
> >> >   override protected def getPartitions: Array[Partition] =
> >> > delegate.partitions
> >> > }
> >> >
> >> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
> >> DelegatingRDD[E](delegate) {
> >> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
> >> DeterministicLevel.INDETERMINATE
> >> > }
> >> >
> >> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
> >> DelegatingRDD[E](delegate) {
> >> >   override def compute(split: Partition, context: TaskContext):
> >> Iterator[E] = {
> >> > val tc = TaskContext.get
> >> > if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
> >> tc.attemptNumber() == 0) {
> >> >   // Wait for all tasks to be done, then call exit
> >> >   Thread.sleep(5000)
> >> >   System.exit(-1)
> >> > }
> >> > delegate.compute(split, context)
> >> >   }
> >> > }
> >> >
> >> > // Make sure test_output directory is deleted before running this.
> >> > //
> >> > object Test {
> >> >
> >> >   def main(args: Array[String]): Unit = {
> >> > val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> >> > val sc = new SparkContext(conf)
> >> >
> >> > val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
> >> 20).map(v => (v, v)))
> >> > val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> >> > resultRdd.saveAsTextFile("test_output")
> >> >   }
> >> > }
> >> >
> >> > -- snip --
> >> >
> >> >
> >> >
> >> > Here, the mapper stage has been forced to be INDETERMINATE.
> >> > In the reducer stage, the first attempt to compute partition 0 will
> >> wait for a bit and then exit - since the master is a local-cluster, this
> >

[jira] [Assigned] (SPARK-45730) [CORE] Make ReloadingX509TrustManagerSuite less flaky

2023-11-02 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45730:
---

Assignee: Hasnain Lakhani

> [CORE] Make ReloadingX509TrustManagerSuite less flaky
> -
>
> Key: SPARK-45730
> URL: https://issues.apache.org/jira/browse/SPARK-45730
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45730) [CORE] Make ReloadingX509TrustManagerSuite less flaky

2023-11-02 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45730.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43596
[https://github.com/apache/spark/pull/43596]

> [CORE] Make ReloadingX509TrustManagerSuite less flaky
> -
>
> Key: SPARK-45730
> URL: https://issues.apache.org/jira/browse/SPARK-45730
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45544) [CORE] Integrate SSL support into TransportContext

2023-10-29 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45544.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43541
[https://github.com/apache/spark/pull/43541]

> [CORE] Integrate SSL support into TransportContext
> --
>
> Key: SPARK-45544
> URL: https://issues.apache.org/jira/browse/SPARK-45544
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Integrate the SSL support into TransportContext so that Spark can use RPC SSL 
> support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45544) [CORE] Integrate SSL support into TransportContext

2023-10-29 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45544:
---

Assignee: Hasnain Lakhani

> [CORE] Integrate SSL support into TransportContext
> --
>
> Key: SPARK-45544
> URL: https://issues.apache.org/jira/browse/SPARK-45544
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> Integrate the SSL support into TransportContext so that Spark can use RPC SSL 
> support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45545) [CORE] Pass SSLOptions wherever we create a SparkTransportConf

2023-10-25 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45545:
---

Assignee: Hasnain Lakhani

> [CORE] Pass SSLOptions wherever we create a SparkTransportConf
> --
>
> Key: SPARK-45545
> URL: https://issues.apache.org/jira/browse/SPARK-45545
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> This ensures that we can properly inherit SSL options and use them 
> everywhere. And tests to ensure things will work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45545) [CORE] Pass SSLOptions wherever we create a SparkTransportConf

2023-10-25 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45545.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43387
[https://github.com/apache/spark/pull/43387]

> [CORE] Pass SSLOptions wherever we create a SparkTransportConf
> --
>
> Key: SPARK-45545
> URL: https://issues.apache.org/jira/browse/SPARK-45545
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This ensures that we can properly inherit SSL options and use them 
> everywhere. And tests to ensure things will work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45541) [CORE] Add SSLFactory

2023-10-22 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45541.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43386
[https://github.com/apache/spark/pull/43386]

> [CORE] Add SSLFactory
> -
>
> Key: SPARK-45541
> URL: https://issues.apache.org/jira/browse/SPARK-45541
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We need to add a factory to support creating SSL engines which will be used 
> to create client/server connections



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan

To add my response - what I described (w.r.t failing job) applies only to
ResultStage.
It walks the lineage DAG to identify all indeterminate parents to rollback.
If there are only ShuffleMapStages in the set of stages to rollback, it
will simply discard their output, rollback all of them, and then retry
these stages (same shuffle-id, a new stage attempt)


Regards,
Mridul



On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
wrote:

>
> Good question, and ResultStage is actually special cased in spark as its
> output could have already been consumed (for example collect() to driver,
> etc) - and so if it is one of the stages which needs to be rolled back, the
> job is aborted.
>
> To illustrate, see the following:
> -- snip --
>
> package org.apache.spark
>
>
> import scala.reflect.ClassTag
>
> import org.apache.spark._
> import org.apache.spark.rdd.{DeterministicLevel, RDD}
>
> class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends RDD[E](delegate) {
>
>   override def compute(split: Partition, context: TaskContext): Iterator[E] = 
> {
> delegate.compute(split, context)
>   }
>
>   override protected def getPartitions: Array[Partition] =
> delegate.partitions
> }
>
> class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends 
> DelegatingRDD[E](delegate) {
>   override def getOutputDeterministicLevel: DeterministicLevel.Value = 
> DeterministicLevel.INDETERMINATE
> }
>
> class FailingRDD[E: ClassTag](delegate: RDD[E]) extends 
> DelegatingRDD[E](delegate) {
>   override def compute(split: Partition, context: TaskContext): Iterator[E] = 
> {
> val tc = TaskContext.get
> if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 && 
> tc.attemptNumber() == 0) {
>   // Wait for all tasks to be done, then call exit
>   Thread.sleep(5000)
>   System.exit(-1)
> }
> delegate.compute(split, context)
>   }
> }
>
> // Make sure test_output directory is deleted before running this.
> //
> object Test {
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> val sc = new SparkContext(conf)
>
> val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1, 
> 20).map(v => (v, v)))
> val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> resultRdd.saveAsTextFile("test_output")
>   }
> }
>
> -- snip --
>
>
>
> Here, the mapper stage has been forced to be INDETERMINATE.
> In the reducer stage, the first attempt to compute partition 0 will wait for 
> a bit and then exit - since the master is a local-cluster, this results in 
> FetchFailure when the second attempt of partition 0 tries to fetch shuffle 
> data.
> When spark tries to regenerate parent shuffle output, it sees that the parent 
> is INDETERMINATE - and so fails the entire job.with the message:
> "
> org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle 
> map stage with indeterminate output was failed and retried. However, Spark 
> cannot rollback the ResultStage 1 to re-process the input data, and has to 
> fail this job. Please eliminate the indeterminacy by checkpointing the RDD 
> before repartition and try again.
> "
>
> This is coming from here 
> <https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
>  - when rolling back stages, if spark determines that a ResultStage needs to 
> be rolled back due to loss of INDETERMINATE output, it will fail the job.
>
> Hope this clarifies.
> Regards,
> Mridul
>
>
> On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:
>
>> In fact, I'm wondering if Spark will rerun the whole reduce
>> ShuffleMapStage
>> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
>>
>> Keyong Zhou  于2023年10月19日周四 23:00写道：
>>
>> > Thanks Erik for bringing up this question, I'm also curious about the
>> > answer, any feedback is appreciated.
>> >
>> > Thanks,
>> > Keyong Zhou
>> >
>> > Erik fang  于2023年10月19日周四 22:16写道：
>> >
>> >> Mridul,
>> >>
>> >> sure, I totally agree SPARK-25299 is a much better solution, as long
>> as we
>> >> can get it from spark community
>> >> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
>> >> celeborn already has spark-integration code with  [spark] scope)
>> >>
>> >> I also have a question about INDETERMINATE stage recompute, and may
>> need
>> >> your help
>> >>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan

Good question, and ResultStage is actually special cased in spark as its
output could have already been consumed (for example collect() to driver,
etc) - and so if it is one of the stages which needs to be rolled back, the
job is aborted.

To illustrate, see the following:
-- snip --

package org.apache.spark


import scala.reflect.ClassTag

import org.apache.spark._
import org.apache.spark.rdd.{DeterministicLevel, RDD}

class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends RDD[E](delegate) {

  override def compute(split: Partition, context: TaskContext): Iterator[E] = {
delegate.compute(split, context)
  }

  override protected def getPartitions: Array[Partition] =
delegate.partitions
}

class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
DelegatingRDD[E](delegate) {
  override def getOutputDeterministicLevel: DeterministicLevel.Value =
DeterministicLevel.INDETERMINATE
}

class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
DelegatingRDD[E](delegate) {
  override def compute(split: Partition, context: TaskContext): Iterator[E] = {
val tc = TaskContext.get
if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
tc.attemptNumber() == 0) {
  // Wait for all tasks to be done, then call exit
  Thread.sleep(5000)
  System.exit(-1)
}
delegate.compute(split, context)
  }
}

// Make sure test_output directory is deleted before running this.
//
object Test {

  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
val sc = new SparkContext(conf)

val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
20).map(v => (v, v)))
val resultRdd = new FailingRDD(mapperRdd.groupByKey())
resultRdd.saveAsTextFile("test_output")
  }
}

-- snip --



Here, the mapper stage has been forced to be INDETERMINATE.
In the reducer stage, the first attempt to compute partition 0 will
wait for a bit and then exit - since the master is a local-cluster,
this results in FetchFailure when the second attempt of partition 0
tries to fetch shuffle data.
When spark tries to regenerate parent shuffle output, it sees that the
parent is INDETERMINATE - and so fails the entire job.with the
message:
"
org.apache.spark.SparkException: Job aborted due to stage failure: A
shuffle map stage with indeterminate output was failed and retried.
However, Spark cannot rollback the ResultStage 1 to re-process the
input data, and has to fail this job. Please eliminate the
indeterminacy by checkpointing the RDD before repartition and try
again.
"

This is coming from here
<https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
- when rolling back stages, if spark determines that a ResultStage
needs to be rolled back due to loss of INDETERMINATE output, it will
fail the job.

Hope this clarifies.
Regards,
Mridul


On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:

> In fact, I'm wondering if Spark will rerun the whole reduce ShuffleMapStage
> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
>
> Keyong Zhou  于2023年10月19日周四 23:00写道：
>
> > Thanks Erik for bringing up this question, I'm also curious about the
> > answer, any feedback is appreciated.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Erik fang  于2023年10月19日周四 22:16写道：
> >
> >> Mridul,
> >>
> >> sure, I totally agree SPARK-25299 is a much better solution, as long as
> we
> >> can get it from spark community
> >> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
> >> celeborn already has spark-integration code with  [spark] scope)
> >>
> >> I also have a question about INDETERMINATE stage recompute, and may need
> >> your help
> >> The rule for INDETERMINATE ShuffleMapStage rerun is reasonable,
> however, I
> >> don't find related logic for INDETERMINATE ResultStage rerun in
> >> DAGScheduler
> >> If INDETERMINATE ShuffleMapStage got entirely recomputed, the
> >> corresponding ResultStage should be entirely recomputed as well, per my
> >> understanding
> >>
> >> I found https://issues.apache.org/jira/browse/SPARK-25342 to rollback a
> >> ResultStage but it was not merged
> >> Do you know any context or related ticket for INDETERMINATE ResultStage
> >> rerun?
> >>
> >> Thanks in advance!
> >>
> >> Regards,
> >> Erik
> >>
> >> On Tue, Oct 17, 2023 at 4:23 AM Mridul Muralidharan 
> >> wrote:
> >>
> >> >
> >> >
> >> > On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:
> >> >
> >> >> Hi Mridul,
> >>

[jira] [Created] (SPARK-45613) Expose DeterministicLevel as a DeveloperApi

2023-10-19 Thread Mridul Muralidharan (Jira)

Mridul Muralidharan created SPARK-45613:
---

 Summary: Expose DeterministicLevel as a DeveloperApi
 Key: SPARK-45613
 URL: https://issues.apache.org/jira/browse/SPARK-45613
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0, 3.4.0, 4.0.0
Reporter: Mridul Muralidharan


{{RDD.getOutputDeterministicLevel}} is a {{DeveloperApi}} which users can 
override to specify the {{DeterministicLevel}} of the {{RDD}}.
Unfortunately, {{DeterministicLevel}} itself is {{private[spark]}}.

Expose {{DeterministicLevel}} to allow users to users this method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45534.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43371
[https://github.com/apache/spark/pull/43371]

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45534) Use `java.lang.ref.Cleaner` instead of `finalize` for `RemoteBlockPushResolver`

2023-10-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45534:
---

Assignee: Min Zhao

> Use `java.lang.ref.Cleaner` instead of `finalize` for 
> `RemoteBlockPushResolver`
> ---
>
> Key: SPARK-45534
> URL: https://issues.apache.org/jira/browse/SPARK-45534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Min Zhao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45576) [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite

2023-10-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-45576:
---

Assignee: Hasnain Lakhani

> [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
> --
>
> Key: SPARK-45576
> URL: https://issues.apache.org/jira/browse/SPARK-45576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> These were added accidentally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45576) [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite

2023-10-17 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-45576.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43404
[https://github.com/apache/spark/pull/43404]

> [CORE] Remove unnecessary debug logs in ReloadingX509TrustManagerSuite
> --
>
> Key: SPARK-45576
> URL: https://issues.apache.org/jira/browse/SPARK-45576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> These were added accidentally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Mridul Muralidharan

On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:

> Hi Mridul,
>
> For a),
> DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier()
> <https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1975>
> to decide whether the whole stage needs to be recomputed
> I think we can pass the same information to Celeborn in
> ShuffleManager.registerShuffle()
> <https://github.com/apache/spark/blob/721ea9bbb2ff77b6d2f575fdca0aeda84990cc3b/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala#L39>,
>  since
> RDD in ShuffleDependency contains the RDD object
> It seems Stage.isIndeterminate() is unreadable from ShuffleDependency, but
> luckily rdd is used internally
>
> def isIndeterminate: Boolean = {
>   rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
> }
>
> Relies on internal implementation is not good, but doable.
> I don't expect spark RDD/Stage implementation changes frequently, and we
> can discuss with Spark community for a RDD isIndeterminate API if they
> change it in the future
>


Only RDD.getOutputDeterministicLevel is publicly exposed,
RDD.outputDeterministicLevel is not and it is private[spark].
While I dont expect changes to this, it is inherently unstable to depend on
it.

Btw, please see the discussion with Sungwoo Park, if Celeborn is
maintaining a reducer oriented view, you will need to recompute all the
mappers anyway - what you might save is the subset of reducer partitions
which can be skipped if it is DETERMINATE.




>
> for c)
> I also considered a similar solution in celeborn
> Celeborn (LifecycleManager) can get the full picture of remaining shuffle
> data from previous stage attempt and reuse it in stage recompute
> , and the whole process will be transparent to Spark/DagScheduler
>

Celeborn does not have visibility into this - and this is potentially
subject to invasive changes in Apache Spark as it evolves.
For example, I recently merged a couple of changes which would make this
different in master compared to previous versions.
Until the remote shuffle service SPIP is implemented and these are
abstracted out & made pluggable, it will continue to be quite volatile.

Note that the behavior for 3.5 and older is known - since Spark versions
have been released - it is the behavior in master and future versions of
Spark which is subject to change.
So delivering on SPARK-25299 would future proof all remote shuffle
implementations.


Regards,
Mridul



>
> Per my perspective, leveraging partial stage recompute and
> remaining shuffle data needs a lot of work to do in Celeborn
> I prefer to implement a simple whole stage recompute first with interface
> defined with recomputeAll = true flag, and explore partial stage recompute
> in seperate ticket as future optimization
> How do you think about it?
>
> Regards,
> Erik
>
>
> On Sat, Oct 14, 2023 at 4:50 PM Mridul Muralidharan 
> wrote:
>
>>
>>
>> On Sat, Oct 14, 2023 at 3:49 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> A reducer oriented view of shuffle, especially without replication,
>>> could indeed be susceptible to this issue you described (a single fetch
>>> failure would require all mappers to need to be recomputed) - note, not
>>> necessarily all reducers to be recomputed though.
>>>
>>> Note that I have not looked much into Celeborn specifically on this
>>> aspect yet, so my comments are *fairly* centric to Spark internals :-)
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Sat, Oct 14, 2023 at 3:36 AM Sungwoo Park  wrote:
>>>
>>>> Hello,
>>>>
>>>> (Sorry for sending the same message again.)
>>>>
>>>> From my understanding, the current implementation of Celeborn makes it
>>>> hard to find out which mapper should be re-executed when a partition cannot
>>>> be read, and we should re-execute all the mappers in the upstream stage. If
>>>> we can find out which mapper/partition should be re-executed, the current
>>>> logic of stage recomputation could be (partially or totally) reused.
>>>>
>>>> Regards,
>>>>
>>>> --- Sungwoo
>>>>
>>>> On Sat, Oct 14, 2023 at 5:24 PM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>   Spark will try to minimize the recomputation cost as much as
>>>>> possible.
>>>>> For example, if parent stage was DETERMINATE, it simply needs to
>>>>> recompute the missin

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1226 matches

Mail list logo