Re: [ANNOUNCE] New Celeborn PMC Member: Mridul Muralidharan

2024-09-20 Thread Mridul Muralidharan
Hi,

  Thank you everyone, it has been a pleasure to work with our Apache
Celeborn community !

Looking forward to continuing to engage with, and help contribute to the
project :-)

Regards,
Mridul

On Fri, Sep 20, 2024 at 5:55 AM Yihe Li  wrote:

> Congratulations!
>
> On 2024/09/20 04:08:10 Fei Wang wrote:
> > Congratulations!
> >
> >
> > Shaoyun Chen 于2024年9月19日 周四下午9:07写道:
> >
> > > Congratulations!
> > >
> > > Nicholas  于2024年9月20日周五 11:25写道:
> > > >
> > > > Congrats and welcome, Mridul!
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Nicholas Jiang
> > > >
> > > >
> > > >
> > > >
> > > > At 2024-09-20 09:30:19, "rexxiong"  wrote:
> > > > >Hi Celeborn Community,
> > > > >
> > > > >The Project Management Committee (PMC) for Apache Celeborn
> > > > >has invited Mridul Muralidharan to become a PMC member and we are
> > > pleased
> > > > >to announce that he has accepted.
> > > > >
> > > > >A PMC member helps manage and guide the direction of the project.
> > > > >We are glad to see his more interactions with the community in the
> > > future.
> > > > >
> > > > >Please join me in congratulating Mridul!
> > > > >
> > > > >Best,
> > > > >Jiashu Xiong
> > >
> >
>


Re: [DRAFT] Celeborn Board Report

2024-08-11 Thread Mridul Muralidharan
Looks good to me, thanks Keyong !

Regards,
Mridul

On Sun, Aug 11, 2024 at 9:59 PM Keyong Zhou  wrote:

> Hi community,
>
> The board report is due on August 14th, following is the draft I made, any
> comments
> will be appreciated, thanks!
>
> ## Description:
> The mission of Apache Celeborn is the creation and maintenance of software
> related to an intermediate data service for big data computing engines to
> boost
> performance, stability, and flexibility
>
> ## Project Status:
> Current project status: Ongoing
> Issues for the board: None
>
> ## Membership Data:
> There are currently 22 committers and 14 PMC members in this project.
> The Committer-to-PMC ratio is roughly 3:2.
>
> Community changes, past quarter:
>
> - Nicholas Jiang was added to the PMC on 2024-07-23.
> - Fei Wang was added as committer on 2024-07-23.
>
> ## Project Activity:
> Software development activity:
>
>  - We released 0.5.1 on July 29th.
>  - We released 0.4.2 on July 26th.
>  - We released 0.5.0 on June 24th.
>  - Support for Apache Flink 1.20 is merged.
>  - Support for Apache Tez is under development.
>  - Several CIPs have been discussed and voted, including Support Flink
> hybrid shuffle, Celeborn CLI, Chaos Testing Framework, etc.
>
> Meetups and Conferences:
>
>  - 4 related talks were given in Apache CoC Asia 2024.
>
> Recent releases:
>
> - 0.5.1 was released on July 29th, 2024.
> - 0.4.2 was released on July 26th, 2024.
> - 0.5.0 was released on June 24th, 2024.
>
> ## Community Health:
> Overall community health is good. In the past quarter, dev mail list had a
> 73% increase in past quarter. We have been performing
> extensive outreach for our users, and encouraging them to contribute back
> to the project. Also, we are active in making a voice
> in various conferences to attract more users.
>
> Regards,
> Keyong Zhou
>


Re: [VOTE] CIP-10: Introduce Celeborn Chaos Testing Framework

2024-08-02 Thread Mridul Muralidharan
+1
This will be very useful for hardening Celeborn as we evolve it !

Regards,
Mridul



On Fri, Aug 2, 2024 at 9:37 AM Nicholas Jiang 
wrote:

> Hi all,
>
> Thanks for all the feedback about the CIP-10: Introduce Celeborn Chaos
> Testing Framework[1]. The discussion thread is here [2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or insufficient votes.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> [1]
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> [2] https://lists.apache.org/thread/670qw80wwfflgv3djqg4304xqy9y8l19
>
> Regards,
> Nicholas Jiang


Re: [ANNOUNCE] New Celeborn PMC Member: Nicholas Jiang

2024-07-23 Thread Mridul Muralidharan
Congratulations !

Regards,
Mridul


On Tue, Jul 23, 2024 at 12:37 PM Fei Wang  wrote:

> Congrats!
>
> Regards,
> Fei Wang
>
> On 2024/07/23 10:19:37 Keyong Zhou wrote:
> > Congrats!
> >
> > Regards,
> > Keyong Zhou
> >
> > angers zhu  于2024年7月23日周二 18:09写道:
> >
> > > Congrats!
> > >
> > >
> > > Thanks
> > > Angerszh
> > >
> > > Cheng Pan  于2024年7月23日周二 18:01写道:
> > >
> > > > Congrats!
> > > >
> > > > Thanks,
> > > > Cheng Pan
> > > >
> > > > On Tue, Jul 23, 2024 at 5:34 PM rexxiong 
> wrote:
> > > > >
> > > > > Hi Celeborn Community,
> > > > >
> > > > > The Project Management Committee (PMC) for Apache Celeborn
> > > > > has invited Nicholas Jiang to become a PMC member and we are
> pleased
> > > > > to announce that he has accepted.
> > > > >
> > > > > A PMC member helps manage and guide the direction of the project.
> > > > > We are glad to see his more interactions with the community in the
> > > > future.
> > > > >
> > > > > Please join me in congratulating Nicholas!
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Jiashu Xiong
> > > >
> > >
> >
>


Re: Re: [ANNOUNCE] New Celeborn Committer: Fei Wang

2024-07-23 Thread Mridul Muralidharan
Congratulations !

Regards,
Mridul

On Tue, Jul 23, 2024 at 12:54 AM rexxiong  wrote:

> Congratulations!
>
> Regards,
> Jiashu Xiong
>
> Nicholas Jiang  于2024年7月23日周二 13:02写道:
>
> > Congratulations!Regards,
> >
> > Nicholas Jiang
> >
> >
> > 在 2024-07-23 12:21:19,"Yihe Li"  写道:
> > >Congratulations!
> > >
> > >Regards,
> > >Yihe Li
> > >
> > >On 2024/07/23 04:16:20 Keyong Zhou wrote:
> > >> Congratulations!
> > >>
> > >> Regards,
> > >> Keyong Zhou
> > >>
> > >> angers zhu  于2024年7月23日周二 12:07写道:
> > >>
> > >> > Congratulations!
> > >> >
> > >> > Shaoyun Chen  于2024年7月23日周二 11:15写道:
> > >> >
> > >> > > Congratulations!
> > >> > >
> > >> > > Cheng Pan  于2024年7月23日周二 11:05写道:
> > >> > > >
> > >> > > > Hi Celeborn Community,
> > >> > > >
> > >> > > > The Project Management Committee (PMC) for Apache Celeborn
> > >> > > > has invited Fei Wang to become a committer and we are pleased
> > >> > > > to announce that he has accepted.
> > >> > > >
> > >> > > > Being a committer enables easier contribution to the
> > >> > > > project since there is no need to go via the patch
> > >> > > > submission process. This should enable better productivity.
> > >> > > > A PMC member helps manage and guide the direction of the
> project.
> > >> > > >
> > >> > > > Please join me in congratulating Fei!
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Cheng Pan
> > >> > >
> > >> >
> > >>
> >
>


Re: Question regarding TLS support in Celeborn

2024-07-11 Thread Mridul Muralidharan
Hi,

  Yes, it is supported.
Note that in addition to TLS communications specifically within Celeborn,
the Apache Ratis message exchange (for Raft HA) requires use of grpc - TLS
is not supported with raft rpc type = netty.

Regards,
Mridul

On Thu, Jul 11, 2024 at 8:57 PM lohit  wrote:

> Hello Celeborn Devs,
>
> We see that in the recent release of 0.5.0 there is now support for TLS.
> Looking at the documentation
> https://celeborn.apache.org/docs/latest/security/ it looks like TLS
> between
> celeborn workers is also supported. Is this accurate? Is there any
> communication between servers in celeborn which is not TLS compliant?
>
> Thank you
> Lohit
>


Re: [DISCUSS] CIP-10: Introduce Celeborn Chaos Testing Framework

2024-07-10 Thread Mridul Muralidharan
Hi,

  This is a great idea - and would go a long way in flushing out bugs and
issues - and improving the overall robustness of Celeborn !
It would also be good to have:
a) Capture a (replay) log of all events which were triggered.
b) Ability to 'replay' the log and deterministically reach the same state.

This will allow us to identify failure cases with the testing framework -
while allowing developers to deterministically reproduce the identified
state.

(Hopefully I did not miss this in the proposal).

Regards,
Mridul


On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang 
wrote:

> Hello community,
>
> It's been a while since the discussion on the Celeborn chaos testing
> framework. The main process of Celeborn chaos testing includes:
>
> 1. Defining a test plan to describe the types of events, the order in
> which events are triggered, and their duration. Event types include node
> anomalies, disk anomalies, IO anomalies, CPU overload, etc.
> 2. The client submits the plan to the scheduler.
> 3. The scheduler sends operations to each node's runner according to the
> plan description.
> 4. The runner is responsible for executing the operations and reporting
> the current status of the node.
> 5. Before triggering an operation, the scheduler deduces the result of
> this event. If it leads to the inability to meet the minimum runnable
> environment for RSS, the event is rejected.
>
> Do you have any thoughts or questions about this chaos testing framework?
> Welcome feedback to further ensure the reliability of Celeborn through
> chaos testing.
>
> Regards,
> Nicholas Jiang
>
> At 2024-07-03 05:20:57, "Nicholas Jiang"  wrote:
> >Hi all,
> >
> >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos
> Testing Framework[1].
> >
> >A chaos testing framework is designed to simulate unpredictable and
> adverse conditions in distributed systems to validate their robustness and
> resilience. This proposal aims to simulate various anomalies and test the
> stability of Celeborn in distributed environments via chaos testing.
> >
> >Looking forward to everyone's feedback and suggestions. Thank you!
> >
> >[1]
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> >
> >Regards,
> >Nicholas Jiang
>


Re: Jira version update

2024-07-09 Thread Mridul Muralidharan
Thanks Keyong !

Regards,
Mridul

On Thu, Jul 4, 2024 at 9:22 PM Keyong Zhou  wrote:

> Thanks Mridul for pointing this out, I just modified 0.5.0 as released :)
>
> Regards,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年7月5日周五 03:13写道:
>
> > Hi,
> >
> >   While updating an issue manually, I noticed that 0.5.0 is still
> mentioned
> > as an unreleased version in jira.
> > Given 0.5 release, we should be getting it updated ?
> >
> > Regards,
> > Mridul
> >
>


Jira version update

2024-07-04 Thread Mridul Muralidharan
Hi,

  While updating an issue manually, I noticed that 0.5.0 is still mentioned
as an unreleased version in jira.
Given 0.5 release, we should be getting it updated ?

Regards,
Mridul


Re: [VOTE] Release Apache Celeborn 0.5.0-rc3

2024-06-24 Thread Mridul Muralidharan
Forgot to update here.

Signatures, digests, etc check out fine.
Checked out tag and build/tested with "-Pspark3.1"

I keep getting the following error:

- metrics/prometheus *** FAILED ***
  200 did not equal 404 (ApiBaseResourceSuite.scala:90)
- metrics/json *** FAILED ***
  200 did not equal 404 (ApiBaseResourceSuite.scala:96)

In unit-tests.log, I have this [1]

After explicitly ignoring these two tests, I then ran into lint issues with
"web" submodule - and could not get around to debugging the issue (I
installed pnpm - which does not seem to be called out in readme).

Thanks,
Mridul

[1]
24/06/24 01:57:33,545 ERROR [ScalaTest-main-running-DiscoverySuite]
MetricsConfig: Error loading configuration file
file:/home/mridul/work/apache/vote/celeborn/service/target/celeborn-service_2.12-0.5.0-tests.jar!/metrics-api.properties
java.io.FileNotFoundException:
file:/home/mridul/work/apache/vote/celeborn/service/target/celeborn-service_2.12-0.5.0-tests.jar!/metrics-api.properties
(No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at
org.apache.celeborn.common.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:95)
at
org.apache.celeborn.common.metrics.MetricsConfig.initialize(MetricsConfig.scala:50)
at
org.apache.celeborn.common.metrics.MetricsSystem.(MetricsSystem.scala:53)
at
org.apache.celeborn.common.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:197)
at
org.apache.celeborn.service.deploy.master.Master.(Master.scala:66)
at
org.apache.celeborn.service.deploy.master.http.api.ApiMasterResourceSuite.beforeAll(ApiMasterResourceSuite.scala:54)
at
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at
org.apache.celeborn.server.common.http.ApiBaseResourceSuite.run(ApiBaseResourceSuite.scala:23)
at org.scalatest.Suite.callExecuteOnSuite$1(Suite.scala:1178)
at org.scalatest.Suite.$anonfun$runNestedSuites$1(Suite.scala:1225)
at
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.scalatest.Suite.runNestedSuites(Suite.scala:1223)
at org.scalatest.Suite.runNestedSuites$(Suite.scala:1156)
at
org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:30)
at org.scalatest.Suite.run(Suite.scala:)
at org.scalatest.Suite.run$(Suite.scala:1096)
at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:30)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
at
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
at
org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
at scala.collection.immutable.List.foreach(List.scala:392)
at
org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
at
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
at
org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
at
org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
at
org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
at org.scalatest.tools.Runner$.main(Runner.scala:775)
at org.scalatest.tools.Runner.main(Runner.scala)


On Mon, Jun 24, 2024 at 2:22 AM Ethan Feng  wrote:

> Thanks for your feedback, I will close this vote
> thread and announce the results soon since 72 hours have passed.
>
>
> Ethan Feng.
>
> kerwin zhang  于2024年6月24日周一 15:18写道:
> >
> > +1 (binding)
> >
> > I checked
> > - git commit hash is correct.
> > - links are valid.
> > - signatures are good.
> > ```
> > gpg --verify apache-celeborn-0.5.0-bin.tgz.asc
> apache-celeborn-0.5.0-bin.tgz
> > gpg --verify apache-celeborn-0.5.0-source.tgz.asc
> > apache-celeborn-0.5.0-source.tgz
> > ```
> > - checksums are good.
> > ```
> > shasum -a 512 apache-celeborn-0.5.0-bin.tgz
> > shasum -a 512 apache-celeborn-0.5.0-source.tgz
> > ```
> >
> > - build success from source code (macOS).
> >
> > ```
> > ./build/make-distribution.sh -Pspark-3.4
> > ```
> >
> > Thanks,
> > Kerwin Zhang
> >
> > Fu Chen  于2024年6月24日周一 14:46写道:
> > >
> > > +1
> > >
> > > I checked
> > > - download links are valid.
> > > - git commit hash is correct.
> > > - no binary files in the source release.
> > > - build success from source code (JD

Re: Re: [VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Mridul Muralidharan
+1

Regards,
Mridul


On Wed, Jun 12, 2024 at 1:08 AM Shaoyun Chen  wrote:

> +1
>
> Keyong Zhou  于2024年6月12日周三 13:47写道:
> >
> > +1
> >
> > Thanks for the proposal!
> >
> > Regards,
> > Keyong Zhou
> >
> > Nicholas Jiang  于2024年6月12日周三 13:02写道:
> >
> > > +1. Looking forward to Celeborn CLI.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Nicholas Jiang
> > >
> > >
> > > At 2024-06-12 12:26:34, "Aravind Patnam"  wrote:
> > > >Hi all,
> > > >
> > > >Sorry, this is the correct link to the Celeborn CLI CIP
> > > ><
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI>
> > > >.
> > > >
> > > >Thanks,
> > > >Aravind
> > > >
> > > >On Tue, Jun 11, 2024 at 9:24 PM Aravind Patnam 
> > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> This is a call to vote to contribute the Celeborn CLI CIP
> > > >> <
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> >
> > > to
> > > >> Apache Celeborn.
> > > >>
> > > >> Please do vote accordingly:
> > > >> [ ] +1 approve
> > > >> [ ] +0 no opinion
> > > >> [ ] -1 disapprove (and the reason)
> > > >>
> > > >> Thanks once again!!
> > > >>
> > > >> Aravind
> > > >>
> > > >
> > > >
> > > >--
> > > >Aravind K. Patnam
> > >
>


Re: Re: [Discussion] Proposal Management in Celeborn Community

2024-06-11 Thread Mridul Muralidharan
  This sounds great, thanks !
We should definitely archive all proposals which were voted on - and track
them for posterity.

Regards,
Mridul


On Tue, Jun 11, 2024 at 5:04 AM rexxiong  wrote:

> Thank you, everyone. From our discussions, it appears that there is a
> general consensus on centralizing the archiving of CIPs within Confluence
> for efficient management.
> However, opinions diverge on the approach to commenting and discussing
> these CIPs. Xintong has shared valuable insights from existing Apache
> projects, which integrate Confluence usage with email lists for efficient
> tracking of discussions. I believe this approach suits us as well.
>
> However, A problem emerged when someone without a Confluence account tried
> to create a CIP within the Celeborn namespace on Confluence (thanks to
> Aravind for pointing out this problem).
> The issue stems from cwiki.apache.org's current policy that restricts new
> user registrations and permits access solely to Apache committers.
> Consequently, relying on Confluence for CIP drafting proves inconvenient
> for all contributors.
>
> In light of this, after discussions among PMC members, we adjust our CIP
> process. Our new plan involves utilizing alternative documentation tools,
> such as Google Docs, for drafting CIPs.
> Subsequently, discussions relevant to these CIPs will take place via our
> mailing lists.
> Finally, the responsibility falls to Celeborn's PMC members and Committers
> to ensure the appropriate archiving of the finalized CIPs within
> Confluence. More details about the CIP process can be found in CIP[1].
>
> [1]
>
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
>
>
> Thanks,
> Jiashu Xiong
>
> Xintong Song  于2024年5月30日周四 13:07写道:
>
> > In fact, Confluence does support inline comments.
> >
> > However, AFAIK communities that adopt Confluence-based proposal
> management
> > (e.g., Flink[1] / Paimon[2] / Kafka[3]) usually encourage discussions to
> > happen on the mailing list.
> >
> > IMHO, discussions in mailing lists are easier to track compared to inline
> > comments. People don't need to subscribe to notifications of individual
> > documents in order to receive updates on changes. For people who joined
> the
> > discussion late or revisit the discussion later, the mailing thread also
> > makes it easy to understand how the entire conversation has taken place.
> > Most importantly, discussions are better kept in one place rather than
> > separated in multiple places.
> >
> > Best,
> >
> > Xintong
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/PAIMON/Paimon+Improvement+Proposals
> >
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, May 30, 2024 at 12:33 PM Nicholas  wrote:
> >
> > > Hi Jiashu,
> > >
> > >
> > >
> > >
> > > +1 for me. According to my experience in the Flink community, the
> > > discussion of the CIP is commented in dev maillist instead of commented
> > in
> > > confluence.
> > >
> > >
> > >
> > >
> > > Anyway, the CIP is required to introduce new feature or major changes.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Nicholas Jiang
> > >
> > >
> > >
> > >
> > > At 2024-05-30 01:29:58, "Mridul Muralidharan" 
> wrote:
> > > >  Inline comments, discussions are invaluable for design docs - this
> is
> > > not
> > > >yet supported in confluence right ?
> > > >Another option would be to iterate and discuss through other means
> (like
> > > >google docs), and before vote, move it to the wiki - so that the
> > community
> > > >is deciding/voting on artifacts which are on the wiki.
> > > >This would also help in case proposals do not end up making it to the
> > vote
> > > >stage, but go through brainstorming/discussion - and evolve into
> > something
> > > >new (or get merged with others).
> > > >
> > > >Regards,
> > > >Mridul
> > > >
> > > >
> > > >On Wed, May 29, 2024 at 10:42 AM Keyong Zhou 
> wrote:
> > > >
> > > >> +1 for m

Re: [DISCUSS] Celeborn CLI Proposal

2024-06-10 Thread Mridul Muralidharan
Hi,

  Looks good to me as well, I had reviewed this proposal internally already
:-)

Regards,
Mridul


On Fri, Jun 7, 2024 at 11:32 PM Keyong Zhou  wrote:

> Hi Aravind,
>
> Thanks for the proposal! The proposal LGTM, I think it's very valuable.
>
> Regards,
> Keyong Zhou
>
> Aravind Patnam  于2024年6月7日周五 12:47写道:
>
> > Hi,
> >
> > Thanks Nicholas for the comments!
> >
> > I now got access to put the proposal in Confluence in the form of CIP,
> here
> > <
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI
> > >
> > it is.
> >
> > Regarding your questions:
> >
> > > 1. From a user's perspective, the CLI is more used for some maintenance
> > operations such as online and offline of server, rescaling of cluster
> etc,
> > not only based on the REST API. What CLI interfaces are there that the
> REST
> > API doesn’t have for maintenance?
> > This is highly dependent on what the user is leveraging to manage their
> > cluster. For example, in k8s, you would be using k8s APIs to achieve
> this.
> > We can probably add a generic interface API for it that provides basic
> > operations that users can implement themselves for their cluster
> management
> > logic based on what cluster managers they are using. Although, I think
> this
> > will likely be a later evolution of the CLI, once basic REST API
> operations
> > are implemented in the CLI. WDYT?
> >
> > > 2. There are same sub-commands between MASTER and WORKER. Why not these
> > sub-commands belong to BOTH?
> > Agreed - this was a formatting mistake. I fixed it now, thanks for
> pointing
> > that out.
> >
> > > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI
> works
> > well no matter the server is alive.
> > Yes, I agree. I think for this we would have to talk to the cluster
> > manager, similar to my response to #1. We would have to query the
> specific
> > cluster manager to get details if the Celeborn servers are dead, since
> the
> > Celeborn REST API would not work then. We can add a generic API that
> users
> > can implement based on their own environment.
> >
> > Thanks,
> > Aravind
> >
> >
> >
> > On Wed, Jun 5, 2024 at 10:43 PM Nicholas Jiang  >
> > wrote:
> >
> > > Hi Aravind,
> > >
> > > Thanks for driving this CIP about Celeborn CLI. I have some comments on
> > > this CIP:
> > >
> > > 1. From a user's perspective, the CLI is more used for some maintenance
> > > operations such as online and offline of server, rescaling of cluster
> > etc,
> > > not only based on the REST API. What CLI interfaces are there that the
> > REST
> > > API doesn’t have for maintenance?
> > >
> > > 2. There are same sub-commands between MASTER and WORKER. Why not these
> > > sub-commands belong to BOTH?
> > >
> > > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI
> works
> > > well no matter the server is alive.
> > >
> > > BTW, could this design doc of proposal follow the template of CIP[1]?
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> > >
> > > Regards,
> > > Nicholas Jiang
> > >
> > > On 2024/06/05 23:33:02 Aravind Patnam wrote:
> > > > Hi all,
> > > >
> > > > I have written up a proposal about introducing a CLI for Celeborn.
> You
> > > can
> > > > find the proposal
> > > > <
> > >
> >
> https://docs.google.com/document/d/1j9wKFSR_ychYDF0NU5YN67WCCtNAgYTbN5CN8V3SOnk/edit?usp=sharing
> > > >
> > > > here.
> > > > Please let me know if you have any comments or questions.
> > > >
> > > > TLDR by introducing a CLI, it would complement the existing dashboard
> > and
> > > > would benefit us internally. We rely on CLI tools internally a lot
> for
> > > > automation and other operations.
> > > >
> > > > FYI, I was not able to access the cwiki page to put this proposal
> > there,
> > > > there seems to be some permissions issue. Hope it is okay to just
> share
> > > as
> > > > a google doc here for now.
> > > >
> > > > --
> > > > Aravind K. Patnam
> > > >
> > > >  Apache Celeborn CLI Proposal
> > > > <
> > >
> >
> https://docs.google.com/document/d/1j9wKFSR_ychYDF0NU5YN67WCCtNAgYTbN5CN8V3SOnk/edit?usp=drive_web
> > > >
> > > >
> > >
> >
> >
> > --
> > Aravind K. Patnam
> >
>


Re: [Discussion] Proposal Management in Celeborn Community

2024-05-29 Thread Mridul Muralidharan
  Inline comments, discussions are invaluable for design docs - this is not
yet supported in confluence right ?
Another option would be to iterate and discuss through other means (like
google docs), and before vote, move it to the wiki - so that the community
is deciding/voting on artifacts which are on the wiki.
This would also help in case proposals do not end up making it to the vote
stage, but go through brainstorming/discussion - and evolve into something
new (or get merged with others).

Regards,
Mridul


On Wed, May 29, 2024 at 10:42 AM Keyong Zhou  wrote:

> +1 for me.
>
> About the comments by Cheng, IMHO discussing in maillist is also acceptable
> (and even better)
>
> Regards,
> Keyong Zhou
>
> Cheng Pan  于2024年5月29日周三 14:32写道:
>
> > +1 for archiving proposals on confluence.
> >
> > Does Confluence support inline comments like Google Docs does? I think
> > it’s a convincing functionality for the discussion period.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On May 29, 2024, at 11:19, rexxiong  wrote:
> > >
> > > Hello, Celeborn community,
> > >
> > > In the past, when Celeborn introduced new major features or significant
> > changes, we typically used Google Docs to launch proposals. However, a
> > major issue with Google Docs is the difficulty in centrally managing
> these
> > proposals. Therefore, after referring to other communities and based on
> > discussions with several PMCs offline, it appears that Apache Confluence
> > could be a viable alternative for our needs. With that in mind, I would
> > like to invite all of you to share your thoughts, experiences, and
> > preferences regarding the use of Apache Confluence versus Google Docs for
> > our proposal management. Your feedback will be invaluable in helping us
> > make an informed decision that best meets the needs of our community.
> > >
> > > Meanwhile, I have archived previous proposals and written the Celeborn
> > Improvement Proposal (CIP) process on Confluence.
> > >
> > > What do you think? Looking forward to your thoughts on this proposal.
> > >
> > >
> > > Thanks,
> > > Jiashu Xiong
> >
> >
>


Re: [VOTE] Release Apache Celeborn 0.4.1-rc1

2024-05-20 Thread Mridul Muralidharan
+1

Signatures, digests, etc check out fine.
Checked out tag and build/tested with "-Pspark3.1"

Regards,
Mridul


On Sun, May 19, 2024 at 10:19 PM rexxiong  wrote:

> +1 (binding)
> I checked
> - Download links are valid.
> - git commit hash is correct
> - Checksums and signatures are valid.
> - No binary files in the source release
> - Successfully built the binary from the source on MacOs with Command:
> ./build/make-distribution.sh -Pspark-3.3
>
> I also tested compatibility with version 0.4.0 by upgrading the
> master/worker from 0.4.0 to 0.4.1. Using a 0.4.0 client to access the 0.4.1
> master/worker, everything worked well.
>
> Thanks,
> Jiashu Xiong
>
> Yihe Li  于2024年5月17日周五 18:47写道:
>
> > +1 (non-binding)
> > I checked the following things:
> > - git commit hash is correct.
> > - download links are valid.
> > - release files are in correct location.
> > - signatures and checksums are good.
> > - LICENSE and NOTICE files exist.
> > - build success from source code(ubuntu 16.04).
> > ```
> > ./build/make-distribution.sh --sbt-enabled -Pspark-3.3
> > ```
> >
> > Thanks,
> > Yihe Li
> >
> > On 2024/05/17 01:53:48 angers zhu wrote:
> > > +1
> > >
> > > - Checked license
> > > - checked doc
> > > - checked build from source with spark-32
> > >
> > > Nicholas Jiang  于2024年5月14日周二 12:13写道:
> > >
> > > > Hi Celeborn community,
> > > >
> > > > This is a call for a vote to release Apache Celeborn
> > > >
> > > > 0.4.1-rc1
> > > >
> > > >
> > > > The git tag to be voted upon:
> > > >
> > > > https://github.com/apache/celeborn/releases/tag/v0.4.1-rc1
> > > >
> > > > The git commit hash:
> > > > 641180142c5ef36430a6afcd702c9487a6007458 source and binary artifacts
> > can be
> > > > found at:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/celeborn/v0.4.1-rc1
> > > >
> > > > The staging repo:
> > > >
> > > >
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1055
> > > >
> > > >
> > > > Fingerprint of the PGP key release artifacts are signed with:
> > > > D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> > > >
> > > > My public key to verify signatures can be found in:
> > > >
> > > > https://dist.apache.org/repos/dist/release/celeborn/KEYS
> > > >
> > > > The vote will be open for at least 72 hours or until the necessary
> > > > number of votes are reached.
> > > >
> > > > Please vote accordingly:
> > > >
> > > > [ ] +1 approve
> > > > [ ] +0 no opinion
> > > > [ ] -1 disapprove (and the reason)
> > > >
> > > > Steps to validate the release:
> > > >
> > > > https://www.apache.org/info/verification.html
> > > >
> > > > * Download links, checksums and PGP signatures are valid.
> > > > * Source code distributions have correct names matching the current
> > > > release.
> > > > * LICENSE and NOTICE files are correct.
> > > > * All files have license headers if necessary.
> > > > * No unlicensed compiled archives bundled in source archive.
> > > > * The source tarball matches the git tag.
> > > > * Build from source is successful.
> > > >
> > > > Regards,
> > > > Nicholas Jiang
> > >
> >
>


Re: [DRAFT] Celeborn Board Report

2024-05-04 Thread Mridul Muralidharan
Ah ! Then it makes sense to not include it :-)
Thanks for clarifying !

Regards,
Mridul


On Sat, May 4, 2024 at 4:15 AM Keyong Zhou  wrote:

> Actually it's the second one. For the first one I didn't send the draft
> to dev maillist for discussion because of lack of experience...
>
> Regards,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年5月3日周五 23:38写道:
>
> > Hi,
> >
> >   I meant call it out as part of the board report, so that it is captured
> > in our updates to board.
> >
> > This is the first update post TLP, right ?
> >
> > Regards,
> > Mridul
> >
> > On Fri, May 3, 2024 at 1:41 AM Keyong Zhou  wrote:
> >
> > > Hi Mridul,
> > >
> > > The news is posted in the following links:
> > >
> > > Apache.org:
> > >
> > >
> >
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
> > >
> > > Newswire:
> > >
> > >
> >
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
> > >
> > > X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
> > >
> > > LinkedIn:
> > >
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
> > >
> > > Besides, we also posted a blog here (in Chinese :D) :
> > > https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> > > <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
> > >
> > > It'll be great if we can call out louder, do you have any idea? : )
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Mridul Muralidharan  于2024年5月3日周五 07:40写道:
> > >
> > > > Hi,
> > > >
> > > >   Do we want to call out graduation to TLP ?
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou 
> wrote:
> > > >
> > > > > Hi community,
> > > > >
> > > > > The board report is due on May 8th, following is the draft I made,
> > any
> > > > > comments
> > > > > will be appreciated, thanks!
> > > > >
> > > > > ## Description:
> > > > > The mission of Apache Celeborn is the creation and maintenance of
> > > > software
> > > > > related to an intermediate data service for big data computing
> > engines
> > > to
> > > > > boost
> > > > > performance, stability, and flexibility
> > > > >
> > > > > ## Project Status:
> > > > > Current project status: New
> > > > > Issues for the board: None
> > > > >
> > > > > ## Membership Data:
> > > > > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > > > > There are currently 21 committers and 13 PMC members in this
> project.
> > > > > The Committer-to-PMC ratio is roughly 3:2.
> > > > >
> > > > > Community changes, past quarter:
> > > > >
> > > > > - No new PMC members (project graduated recently).
> > > > > - Chandni Singh was added as committer on 2024-03-21.
> > > > > - Mridul Muralidharan was added as committer on 2024-04-29.
> > > > >
> > > > > ## Project Activity:
> > > > > Software development activity:
> > > > >
> > > > >  - We are preparing to release 0.4.1 in May.
> > > > >  - We are preparing to release 0.5.0 in May.
> > > > >  - Security support (authentication and SSL) has been merged.
> > > > >  - Memory storage is close to being merged.
> > > > >
> > > > > Meetups and Conferences:
> > > > >
> > > > >  - An online meetup was held on April 16th with some developers.
> > > > >  - An online meetup was held on April 25th with some users.
> > > > >
> > > > > Recent releases:
> > > > >
> > > > > - 0.4.0-incubating was released on 2024-02-06.
> > > > > - 0.3.2-incubating was released on 2024-01-08.
> > > > >
> > > > > ## Community Health:
> > > > > Overall community health is good. In the past quarter,
> > dev/issues/users
> > > > > mail list had 6%/11%/1100% increase in traffic respectively.
> > > > > We expect issues traffic to be steady, but there may be fluctuation
> > for
> > > > > dev/users traffic because many of the discussions happen
> > > > > in slack/wechat/dingtalk. We are encouraging more discussion to
> > happen
> > > in
> > > > > maillists.
> > > > >
> > > > > We have been performing extensive outreach for our users, and
> > > encouraging
> > > > > them to contribute back to the project. Also, we are
> > > > > active in making a voice in various conferences to attract more
> > users.
> > > > >
> > > > > Regards,
> > > > > Keyong Zhou
> > > > >
> > > >
> > >
> >
>


Re: [DRAFT] Celeborn Board Report

2024-05-03 Thread Mridul Muralidharan
Hi,

  I meant call it out as part of the board report, so that it is captured
in our updates to board.

This is the first update post TLP, right ?

Regards,
Mridul

On Fri, May 3, 2024 at 1:41 AM Keyong Zhou  wrote:

> Hi Mridul,
>
> The news is posted in the following links:
>
> Apache.org:
>
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
>
> Newswire:
>
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
>
> X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
>
> LinkedIn:
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
>
> Besides, we also posted a blog here (in Chinese :D) :
> https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
>
> It'll be great if we can call out louder, do you have any idea? : )
>
> Regards,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年5月3日周五 07:40写道:
>
> > Hi,
> >
> >   Do we want to call out graduation to TLP ?
> >
> > Regards,
> > Mridul
> >
> > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:
> >
> > > Hi community,
> > >
> > > The board report is due on May 8th, following is the draft I made, any
> > > comments
> > > will be appreciated, thanks!
> > >
> > > ## Description:
> > > The mission of Apache Celeborn is the creation and maintenance of
> > software
> > > related to an intermediate data service for big data computing engines
> to
> > > boost
> > > performance, stability, and flexibility
> > >
> > > ## Project Status:
> > > Current project status: New
> > > Issues for the board: None
> > >
> > > ## Membership Data:
> > > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > > There are currently 21 committers and 13 PMC members in this project.
> > > The Committer-to-PMC ratio is roughly 3:2.
> > >
> > > Community changes, past quarter:
> > >
> > > - No new PMC members (project graduated recently).
> > > - Chandni Singh was added as committer on 2024-03-21.
> > > - Mridul Muralidharan was added as committer on 2024-04-29.
> > >
> > > ## Project Activity:
> > > Software development activity:
> > >
> > >  - We are preparing to release 0.4.1 in May.
> > >  - We are preparing to release 0.5.0 in May.
> > >  - Security support (authentication and SSL) has been merged.
> > >  - Memory storage is close to being merged.
> > >
> > > Meetups and Conferences:
> > >
> > >  - An online meetup was held on April 16th with some developers.
> > >  - An online meetup was held on April 25th with some users.
> > >
> > > Recent releases:
> > >
> > > - 0.4.0-incubating was released on 2024-02-06.
> > > - 0.3.2-incubating was released on 2024-01-08.
> > >
> > > ## Community Health:
> > > Overall community health is good. In the past quarter, dev/issues/users
> > > mail list had 6%/11%/1100% increase in traffic respectively.
> > > We expect issues traffic to be steady, but there may be fluctuation for
> > > dev/users traffic because many of the discussions happen
> > > in slack/wechat/dingtalk. We are encouraging more discussion to happen
> in
> > > maillists.
> > >
> > > We have been performing extensive outreach for our users, and
> encouraging
> > > them to contribute back to the project. Also, we are
> > > active in making a voice in various conferences to attract more users.
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> >
>


Re: [DRAFT] Celeborn Board Report

2024-05-02 Thread Mridul Muralidharan
Hi,

  Do we want to call out graduation to TLP ?

Regards,
Mridul

On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:

> Hi community,
>
> The board report is due on May 8th, following is the draft I made, any
> comments
> will be appreciated, thanks!
>
> ## Description:
> The mission of Apache Celeborn is the creation and maintenance of software
> related to an intermediate data service for big data computing engines to
> boost
> performance, stability, and flexibility
>
> ## Project Status:
> Current project status: New
> Issues for the board: None
>
> ## Membership Data:
> Apache Celeborn was founded 2024-03-20 (2 months ago).
> There are currently 21 committers and 13 PMC members in this project.
> The Committer-to-PMC ratio is roughly 3:2.
>
> Community changes, past quarter:
>
> - No new PMC members (project graduated recently).
> - Chandni Singh was added as committer on 2024-03-21.
> - Mridul Muralidharan was added as committer on 2024-04-29.
>
> ## Project Activity:
> Software development activity:
>
>  - We are preparing to release 0.4.1 in May.
>  - We are preparing to release 0.5.0 in May.
>  - Security support (authentication and SSL) has been merged.
>  - Memory storage is close to being merged.
>
> Meetups and Conferences:
>
>  - An online meetup was held on April 16th with some developers.
>  - An online meetup was held on April 25th with some users.
>
> Recent releases:
>
> - 0.4.0-incubating was released on 2024-02-06.
> - 0.3.2-incubating was released on 2024-01-08.
>
> ## Community Health:
> Overall community health is good. In the past quarter, dev/issues/users
> mail list had 6%/11%/1100% increase in traffic respectively.
> We expect issues traffic to be steady, but there may be fluctuation for
> dev/users traffic because many of the discussions happen
> in slack/wechat/dingtalk. We are encouraging more discussion to happen in
> maillists.
>
> We have been performing extensive outreach for our users, and encouraging
> them to contribute back to the project. Also, we are
> active in making a voice in various conferences to attract more users.
>
> Regards,
> Keyong Zhou
>


Re: [ANNOUNCE] Add Mridul Muralidharan as new committer

2024-04-28 Thread Mridul Muralidharan
Thank you everyone :-)
It has been a pleasure working with the Celeborn community, and I look
forward to continuing to learn from and contribute to the project !

Regards,
Mridul

On Mon, Apr 29, 2024 at 12:38 AM Fu Chen  wrote:

> Congratulations and thank you to Mridul for the contributions to the
> community!
>
> Regards,
> Fu Chen
>
> Cheng Pan  于2024年4月29日周一 12:10写道:
> >
> > Congrats Mridul, your expertise in Spark kernel, and the Security area
> are impressive.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Apr 29, 2024, at 09:21, Keyong Zhou  wrote:
> > >
> > > Hi Celeborn Community,
> > >
> > > The Project Management Committee (PMC) for Apache Celeborn
> > > has invited Mridul Muralidharan to become a committer and we are
> pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Mridul Muralidharan!
> > >
> > > Regards,
> > > Keyong Zhou
> >
>


Re: [DISCUSS] Time for 0.4.1

2024-04-19 Thread Mridul Muralidharan
+1

Regards,
Mridul


On Thu, Apr 18, 2024 at 11:50 PM Ethan Feng  wrote:

> +1
>
> Thanks,
> Ethan Feng
>
> Yu Li  于2024年4月16日周二 17:20写道:
> >
> > +1, thanks for driving this and volunteering as our RM, Nicholas!
> >
> > Best Regards,
> > Yu
> >
> > On Sat, 13 Apr 2024 at 10:31, Keyong Zhou  wrote:
> > >
> > > +1, thanks Nicholas for volunteering!
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Shaoyun Chen  于2024年4月12日周五 22:03写道:
> > >
> > > > +1
> > > >
> > > > Cheng Pan  于2024年4月12日周五 20:04写道:
> > > > >
> > > > > +1, we do need a patch release for 0.4
> > > > >
> > > > > Thanks,
> > > > > Cheng Pan
> > > > >
> > > > >
> > > > > > On Apr 12, 2024, at 19:59, Nicholas Jiang <
> nicholasji...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > Hey, Celeborn community,
> > > > > >
> > > > > >
> > > > > > It has been a while since the 0.4.0 release, and there are some
> > > > critical fixes land branch-0.4, for example,
> [CELEBORN-1252][FOLLOWUP] Fix
> > > > Worker#computeResourceConsumption NullPointerException for
> > > > userResourceConsumption that does not contain given userIdentifier.
> From my
> > > > perspective, it’s time to prepare for releasing 0.4.1.
> > > > > >
> > > > > >
> > > > > > WDYT? And I’m volunteering to be the release manager if no one
> has
> > > > applied.
> > > > > >
> > > > > > Regards,
> > > > > > Nicholas Jiang
> > > > >
> > > > >
> > > >
>


Re: [ANNOUNCE] Apache Celeborn is graduated to Top Level Project

2024-03-26 Thread Mridul Muralidharan
Congratulations !!

Regards,
Mridul


On Tue, Mar 26, 2024 at 11:54 PM Nicholas Jiang 
wrote:

> Congratulations! Witness the continuous development of the
> community.Regards,
> Nicholas Jiang
> At 2024-03-25 20:49:36, "Ethan Feng"  wrote:
> >Hello Celeborn community,
> >
> >I am glad to share that the ASF board has approved a resolution to
> >graduate Celeborn into a full Top Level Project. Thank you all for
> >your help in reaching this milestone.
> >
> >To transition from the Apache Incubator to a new TLP, there are a few
> >action items[1] we need to complete the transition. I have opened an
> >Umbrella Issue[2] to track the tasks, and you are welcome to take on
> >the sub-tasks and leave comments if I have missed anything.
> >
> >Additionally, the GitHub repository migration is already complete[3].
> >Please update your local git repository to track the new repo[4]. If
> >you named the upstream as "apache", you can run the following command
> >to complete the remote repo tracking migration.
> >
> >` git remote set-url apache g...@github.com:apache/celeborn.git `
> >
> >Please find the relevant URLs below:
> >[1]
> https://incubator.apache.org/guides/transferring.html#life_after_graduation
> >[2] https://github.com/apache/celeborn/issues/2415
> >[3] https://issues.apache.org/jira/browse/INFRA-25635
> >[4] https://github.com/apache/celeborn
> >
> >Thanks,
> >Ethan Feng
>


Re: Maven 'stuck' in service test compilation ?

2024-03-21 Thread Mridul Muralidharan
Hi Ethan,

  Thanks for checking !
It appears that my desktop java version had been upgraded, which resulted
in the failures ...
Reverting it back to java 8 fixed the issues seen.

Regards,
Mridul


On Thu, Mar 21, 2024 at 7:27 AM Ethan Feng  wrote:

> Hi Mridul,
>
> I've tried your scripts on my local environment(JDK8) and the problem
> is not reproduced. Both Maven 3.8.8 and 3.9.6 are tested.
> Changing  might help as I've encountered some maven bugs
> before.
>
> I think there needs more information about how to reproduce this
> problem, like the environmental information, JDK version, etc.
>
> Regards,
> Ethan Feng
>
> Mridul Muralidharan  于2024年3月21日周四 15:25写道:
> >
> > Hi,
> >
> >
> >   I am observing that a maven build gets 'stuck' when compiling
> "services"
> > for running tests.
> > Without tests, this goes through:
> >
> > $ ARGS="-Pspark-3.1"; ./build/mvn $ARGS clean 2>&1 | tee clean_output.txt
> > && ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt
> >
> > This gets stuck indefinitely:
> >
> > $ ARGS="-Pspark-3.1"; ./build/mvn  $ARGS package 2>&1 | tee
> test_output.txt
> > See [1] for output snippet.
> >
> > Strangely, running with -X seemed to be fine (the one time I tried it).
> >
> > I have made some dependency changes to pom.xml, but no changes to
> > service module.
> >
> > Anything I am missing here ? Any hints would be greatly appreciated :-)
> >
> > Thanks !
> > Mridul
> >
> > [1]
> >
> > [INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @
> > celeborn-service_2.12 ---
> > [INFO] Using 'UTF-8' encoding to copy filtered resources.
> > [INFO] Using 'UTF-8' encoding to copy filtered properties files.
> > [INFO] Copying 1 resource
> > [INFO] Copying 3 resources
> > [INFO]
> > [INFO] --- scala-maven-plugin:4.7.2:compile (scala-compile-first) @
> > celeborn-service_2.12 ---
> > [INFO] Using incremental compilation using Mixed compile order
> > [INFO] Compiler bridge file:
> >
> /home/mridul/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.7.1-bin_2.12.10__61.0-1.7.1_20220712T022208.jar
> > [INFO] compiler plugin:
> > BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
> > [INFO] compiling 4 Scala sources and 9 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 2 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 1 Scala source and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] NoPosition: Note: Some input files use unchecked or unsafe
> > operations.
> > [INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 2 Java sources to
> >
> /home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
> > ...
> > [INFO] done compiling
> > [INFO] compiling 5 Scala sources and 5 Java sources to
> >
> /h

Re: [ANNOUNCE] Add Chandni Singh as new committer

2024-03-21 Thread Mridul Muralidharan
Congratulations Chandni ! Great job :-)

Regards,
Mridul


On Thu, Mar 21, 2024 at 3:30 AM Keyong Zhou  wrote:

> Hi Celeborn Community,
>
> The Podling Project Management Committee (PPMC) for Apache Celeborn
> has invited Chandni Singh to become a committer and we are pleased
> to announce that she has accepted.
>
> Being a committer enables easier contribution to the
> project since there is no need to go via the patch
> submission process. This should enable better productivity.
> A (P)PMC member helps manage and guide the direction of the project.
>
> Please join me in congratulating Chandni Singh!
>
> Thanks,
> Keyong Zhou
>


Maven 'stuck' in service test compilation ?

2024-03-21 Thread Mridul Muralidharan
Hi,


  I am observing that a maven build gets 'stuck' when compiling "services"
for running tests.
Without tests, this goes through:

$ ARGS="-Pspark-3.1"; ./build/mvn $ARGS clean 2>&1 | tee clean_output.txt
&& ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt

This gets stuck indefinitely:

$ ARGS="-Pspark-3.1"; ./build/mvn  $ARGS package 2>&1 | tee test_output.txt
See [1] for output snippet.

Strangely, running with -X seemed to be fine (the one time I tried it).

I have made some dependency changes to pom.xml, but no changes to
service module.

Anything I am missing here ? Any hints would be greatly appreciated :-)

Thanks !
Mridul

[1]

[INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @
celeborn-service_2.12 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Using 'UTF-8' encoding to copy filtered properties files.
[INFO] Copying 1 resource
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:4.7.2:compile (scala-compile-first) @
celeborn-service_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file:
/home/mridul/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.7.1-bin_2.12.10__61.0-1.7.1_20220712T022208.jar
[INFO] compiler plugin:
BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null)
[INFO] compiling 4 Scala sources and 9 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 2 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 1 Scala source and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling
[INFO] compiling 5 Scala sources and 2 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] done compiling
[INFO] compiling 5 Scala sources and 5 Java sources to
/home/mridul/work/apache/celeborn/incubator-celeborn/service/target/classes
...
[INFO] NoPosition: Note: Some input files use unchecked or unsafe
operations.
[INFO] NoPosition: Note: Recompile with -Xlint:unchecked for details.
[INFO] done compiling

And then keeps indefinitely repeating this.


Re: [VOTE] Graduate Apache Celeborn (incubating) as a TLP - Community

2024-03-01 Thread Mridul Muralidharan
+1

Regards,
Mridul


On Fri, Mar 1, 2024 at 4:35 AM Nicholas  wrote:

>
> +1.
>
>
> Regards,
> Nicholas Jiang
>
>
>
>
> --
> 发自我的网易邮箱手机智能版
> 
>
>
> - Original Message -
> From: "Yu Li" 
> To: dev@celeborn.apache.org
> Sent: Fri, 1 Mar 2024 16:52:10 +0800
> Subject: [VOTE] Graduate Apache Celeborn (incubating) as a TLP - Community
>
> Hi All,
>
> After a thorough discussion [1], I'd like to call a formal vote to
> graduate Apache Celeborn (incubating) as a TLP. Below are some facts
> and project highlights carried from [1] as well as the draft
> resolution:
>
> - Currently, our community consists of 19 committers (including
> mentors) from more than 10 companies, with 12 serving as PPMC members.
> - So far, we have boasted 81 contributors.
> - Throughout the incubation period, we've made 6 releases in 16
> months, at a stable pace.
> - We've had 6 different release managers to date.
> - Our software is used in production by 10+ well known entities.
> - As yet, we have opened 1,286 issues with 1,176 successfully resolved.
> - We have submitted a total of 1,816 PRs, out of which 1,805 have been
> merged or closed.
> - Through self-assessment [2], we have met all maturity criteria as
> outlined in [3].
>
> We've resolved all branding issues which include Logo, GitHub repo,
> document, website, and others [4] [5].
>
> --
> Establish the Apache Celeborn Project
>
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to an intermediate data service for big data computing engines
> to boost performance, stability, and flexibility.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
>
> RESOLVED, that the Apache Celeborn Project be and hereby is responsible
> for the creation and maintenance of software related to an intermediate
> data service for big data computing engines to boost performance,
> stability, and flexibility; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Celeborn
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Celeborn
> Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Celeborn
> Project:
>
>  * Becket Qin
>  * Cheng Pan 
>  * Duo Zhang 
>  * Ethan Feng
>  * Fu Chen   
>  * Jiashu Xiong  
>  * Kerwin Zhang  
>  * Keyong Zhou   
>  * Lidong Dai
>  * Willem Ning Jiang 
>  * Wu Wei
>  * Yi Zhu
>  * Yu Li 
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Keyong Zhou be appointed to
> the office of Vice President, Apache Celeborn, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
>
> RESOLVED, that the Apache Celeborn Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Celeborn
> podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Celeborn podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
> --
>
> Best Regards,
> Yu
>
> [1] https://lists.apache.org/thread/z17rs0mw4nyv0s112dklmv7s3j053mby
> [2]
> https://cwiki.apache.org/confluence/display/CELEBORN/Apache+Maturity+Model+Assessment+for+Celeborn
> [3]
> https://community.apache.org/apache-way/apache-project-maturity-model.html
> [4] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> [5] https://whimsy.apache.org/pods/project/celeborn
>


Re: [DISCUSS] Graduate Celeborn as TLP

2024-02-28 Thread Mridul Muralidharan
+1
Looking forward to Celeborn as a TLP !

Best wishes to the community :-)

Regards,
Mridul


On Tue, Feb 27, 2024 at 5:23 AM Willem Jiang  wrote:

> Thanks for the clarification. Now we are good to go.
>
> Willem Jiang
>
>
>
> On Tue, Feb 27, 2024 at 7:15 PM Keyong Zhou  wrote:
> >
> > Thanks Willian for the information, as Cheng said, we didn't start the
> > registration process before :)
> >
> > Best,
> > Keyong Zhou
> >
> > Willem Jiang  于2024年2月27日周二 18:48写道:
> >
> > > It‘s OK if we don't register any trademark of Celeborn.
> > > If we already registered the trademark of Celeborn, we need to have
> > > the approval of the trademark VP.
> > >
> > > Willem Jiang
> > >
> > >
> > > On Tue, Feb 27, 2024 at 6:19 PM Cheng Pan  wrote:
> > > >
> > > > Hi Willem,
> > > >
> > > > For trademark concerns, the "Apache Celeborn” gets approval by
> ASF[1],
> > > do we need any additional work?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > > >
> > > > Thanks,
> > > > Cheng Pan
> > > >
> > > >
> > > > > On Feb 27, 2024, at 17:45, Willem Jiang 
> > > wrote:
> > > > >
> > > > > +1, it's good to see Celeborn is ready for graduation.
> > > > >
> > > > > I have a quick question about Celeborn's trademark. Did we start
> the
> > > > > registration process before?
> > > > >
> > > > > BTW  the podling name search is approved by trademark VP [1]
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > > > >
> > > > > Willem Jiang
> > > > >
> > > > > On Tue, Feb 27, 2024 at 9:40 AM Yu Li  wrote:
> > > > >>
> > > > >> Dear Celeborn Devs,
> > > > >>
> > > > >> We, the Celeborn community, began our incubation journey on
> October
> > > > >> 18, 2022. Since then, with the continuous efforts of you all, our
> > > > >> community has steadily developed and gradually matured,
> approaching
> > > > >> the graduation criteria [1]. Therefore, I'd like to call a
> discussion
> > > > >> to graduate Celeborn as TLP. Below are some statistics I
> collected,
> > > > >> please check it and let me know your thoughts.
> > > > >>
> > > > >> - Currently, our community consists of 19 committers (including
> > > > >> mentors) from more than 10 companies, with 12 serving as PPMC
> members
> > > > >> [2].
> > > > >> - So far, we have boasted 81 contributors.
> > > > >> - Throughout the incubation period, we've made 6 releases [3] in
> 16
> > > > >> months, at a stable pace.
> > > > >> - We've had 6 different release managers to date.
> > > > >> - Our software is used in production by 10+ well known entities
> [4].
> > > > >> - As yet, we have opened 1,286 issues with 1,176 successfully
> > > resolved [5].
> > > > >> - We have submitted a total of 1,816 PRs, out of which 1,805 have
> been
> > > > >> merged or closed [6].
> > > > >> - Through self-assessment [7], we have met all maturity criteria
> as
> > > > >> outlined in [1].
> > > > >>
> > > > >> And below is the drafted graduation resolution, JFYI:
> > > > >> --
> > > > >> Establish the Apache Celeborn Project
> > > > >>
> > > > >> WHEREAS, the Board of Directors deems it to be in the best
> interests
> > > of
> > > > >> the Foundation and consistent with the Foundation's purpose to
> > > establish
> > > > >> a Project Management Committee charged with the creation and
> > > maintenance
> > > > >> of open-source software, for distribution at no charge to the
> public,
> > > > >> related to an intermediate data service for big data computing
> engines
> > > > >> to boost performance, stability, and flexibility.
> > > > >>
> > > > >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee
> > > > >> (PMC), to be known as the "Apache Celeborn Project", be and
> hereby is
> > > > >> established pursuant to Bylaws of the Foundation; and be it
> further
> > > > >>
> > > > >> RESOLVED, that the Apache Celeborn Project be and hereby is
> > > responsible
> > > > >> for the creation and maintenance of software related to an
> > > intermediate
> > > > >> data service for big data computing engines to boost performance,
> > > > >> stability, and flexibility; and be it further
> > > > >>
> > > > >> RESOLVED, that the office of "Vice President, Apache Celeborn" be
> and
> > > > >> hereby is created, the person holding such office to serve at the
> > > > >> direction of the Board of Directors as the chair of the Apache
> > > Celeborn
> > > > >> Project, and to have primary responsibility for management of the
> > > > >> projects within the scope of responsibility of the Apache Celeborn
> > > > >> Project; and be it further
> > > > >>
> > > > >> RESOLVED, that the persons listed immediately below be and hereby
> are
> > > > >> appointed to serve as the initial members of the Apache Celeborn
> > > > >> Project:
> > > > >>
> > > > >> * Becket Qin
> > > > >> * Cheng Pan 
> > > > >> * Duo Zhang 
> > > > >> * Ethan Feng
> > > > >> * Fu Chen   
> > > > >> * Jiashu Xiong  
> > > > 

Re: [ANNONCE] New PPMC member: Fu Chen

2024-02-19 Thread Mridul Muralidharan
Congratulations !

Regards,
Mridul

On Mon, Feb 19, 2024 at 7:46 PM Cheng Pan  wrote:

> Congrats!
>
> Thanks,
> Cheng Pan
>
>
> > On Feb 20, 2024, at 08:22, Nicholas  wrote:
> >
> > Congratulations to Fu Chen!Regards,
> > Nicholas Jiang
> >
> >
> >
> >
> > At 2024-02-20 00:23:06, "Shaoyun Chen"  wrote:
> >> Congratulations!
> >>
> >> Keyong Zhou  于2024年2月19日周一 21:16写道:
> >>>
> >>> Hi Celeborn Community,
> >>>
> >>> The Podling Project Management Committee (PPMC) for Apache Celeborn
> >>> has invited Fu Chen to become our PPMC member and
> >>> we are pleased to announce that he has accepted.
> >>>
> >>> Fu Chen has been actively contributing to Celeborn community for more
> then
> >>> one year[1], including SBT build,
> >>> performance improvement, code refactor, bug fixes, code reviews, design
> >>> discussion, docs, etc.
> >>>
> >>> Please join me in congratulating Fu Chen!
> >>>
> >>> Being a committer enables easier contribution to the
> >>> project since there is no need to go via the patch
> >>> submission process. This should enable better productivity.
> >>> A PPMC member helps manage and guide the direction of the project.
> >>>
> >>> [1]
> https://github.com/apache/incubator-celeborn/commits?author=cfmcgrady
> >>>
> >>> Thanks,
> >>> On behalf of the Apache Celeborn PPMC
>
>


Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan
Hi,

  I am fine with either actually - though more used to jira personally :-)
(github issues has a nice integrations with pr's which has been useful
though).
The main reason why I asked is what Nicholas clarified about - saw a
nontrivial number of github issue related mails, and was not sure if we
were moving to using that !

Thanks,
Mridul


On Wed, Feb 7, 2024 at 12:52 AM Keyong Zhou  wrote:

> Hi Mridul,
>
> Thanks for asking. In fact at the time when donating Celeborn to ASF
> incubator we had a discussion whether to use JIRA or
> Github for issue tracking and we decided to choose JIRA at last. Seems
> different projects have different preferences. Maybe
> newer projects tends to use Github.
>
> To me, I'm actually fine with both. JIRA works well so far, will using
> Github be more beneficial? Glad to hear about your opinion.
>
> Thanks,
> Keyong Zhou
>
> Mridul Muralidharan  于2024年2月7日周三 14:03写道:
>
> >   Looks like I am wrong, github issues can be used [1].
> > Is Celeborn planning to use github issues going forward ?
> >
> > Regards,
> > Mridul
> >
> >
> > [1] https://www.apache.org/dev/#issues
> >
> >
> > On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
> > wrote:
> >
> > > Hi,
> > >
> > >   I received a fairly large number of emails to
> > > incubator-celeb...@noreply.github.com, which typically are for PR's.
> > > They appear to be github issues - are we trying to move to github
> issues
> > > instead of Apache jira ? IIRC there is a policy to use jira for
> tracking
> > > bugs/improvements, right ?
> > >
> > > Regards,
> > > Mridul
> > >
> >
>


Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan
  Looks like I am wrong, github issues can be used [1].
Is Celeborn planning to use github issues going forward ?

Regards,
Mridul


[1] https://www.apache.org/dev/#issues


On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
wrote:

> Hi,
>
>   I received a fairly large number of emails to
> incubator-celeb...@noreply.github.com, which typically are for PR's.
> They appear to be github issues - are we trying to move to github issues
> instead of Apache jira ? IIRC there is a policy to use jira for tracking
> bugs/improvements, right ?
>
> Regards,
> Mridul
>


Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Mridul Muralidharan
Hi,

  I received a fairly large number of emails to
incubator-celeb...@noreply.github.com, which typically are for PR's.
They appear to be github issues - are we trying to move to github issues
instead of Apache jira ? IIRC there is a policy to use jira for tracking
bugs/improvements, right ?

Regards,
Mridul


Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.2-incubating-rc0

2023-12-19 Thread Mridul Muralidharan
+1

Signatures, digests, license, etc check out fine.
Checked out tag and build/tested with -Pspark3.1 and -Pflink-1.17

Regards,
Mridul


On Tue, Dec 19, 2023 at 8:06 PM rexxiong  wrote:

> +1 (binding)
> I checked
> - Download links are valid.
> - git commit hash is correct
> - Checksums and signatures are valid.
> - No binary files in the source release
> - Files have the word incubating in their name.
> - DISCLAIMER,LICENSE and NOTICE files exist.
> - Successfully built the binary from the source on MacOs with Command:
> ./build/make-distribution.sh --release
>
> Thanks,
> Jiashu Xiong
>
> Fu Chen  于2023年12月20日周三 00:18写道:
>
> > +1
> >
> > I checked
> > - download links are valid.
> > - git commit hash is correct.
> > - no binary files in the source release.
> > - signatures are good.
> > ```
> > gpg --import KEYS
> > gpg --verify apache-celeborn-0.3.2-incubating-source.tgz.asc
> > gpg --verify apache-celeborn-0.3.2-incubating-bin.tgz.asc
> > ```
> > - checksums are good.
> > ```
> > sha512sum --check apache-celeborn-0.3.2-incubating-source.tgz.sha512
> > sha512sum --check apache-celeborn-0.3.2-incubating-bin.tgz.sha512
> > ```
> > - build success from source code (Pop!_OS 22.04 LTS).
> > ```
> > ./build/mvn clean package -DskipTests -Pspark-3.4
> > ```
> >
> > Shaoyun Chen  于2023年12月19日周二 22:45写道:
> >
> > > +1 (non-binding)
> > >
> > > I checked the following things:
> > >
> > > - signatures are good.
> > > ```
> > > gpg --import KEYS
> > > gpg --verify apache-celeborn-0.3.2-incubating-source.tgz.asc
> > > gpg --verify apache-celeborn-0.3.2-incubating-bin.tgz.asc
> > > ```
> > > - checksums are good.
> > > ```
> > > sha512sum --check apache-celeborn-0.3.2-incubating-source.tgz.sha512
> > > sha512sum --check apache-celeborn-0.3.2-incubating-bin.tgz.sha512
> > > ```
> > > - build success from source code.
> > > ```
> > > ./build/make-distribution.sh -Pspark-3.2
> > > ./build/make-distribution.sh --release
> > > ```
> > >
> > > Yihe Li  于2023年12月19日周二 20:56写道:
> > > >
> > > > +1 (non-binding)
> > > > I checked the following things:
> > > > - git commit hash is correct.
> > > > - download links are valid.
> > > > - release files are in correct location.
> > > > - release files have the word incubating in their name.
> > > > - signatures and checksums are good.
> > > > - DISCLAIMER, LICENSE and NOTICE files exist.
> > > > - build success from source code(ubuntu 16.04).
> > > > ```
> > > > ./build/make-distribution.sh --release
> > > > ./build/make-distribution.sh --Pspark-3.3
> > > > ```
> > > >
> > > > Thanks,
> > > > Yihe Li
> > > >
> > > > On 2023/12/19 06:37:40 Ethan Feng wrote:
> > > > > +1(binding)
> > > > >
> > > > > I checked:
> > > > > √ release files in the correct location
> > > > > √ release files have the word incubating
> > > > > √ digital signature and hashes correct
> > > > > √ DISCLAIMER file exist
> > > > > √ LICENSE and NOTICE files exist and are correct
> > > > > √ the contents of the release match the tag in VCS
> > > > > √ can build the release from the source
> > > > > √ maven artifacts looks correct
> > > > >
> > > > > Thanks,
> > > > > Ethan Feng
> > > > >
> > > > > Nicholas Jiang  于2023年12月19日周二 12:32写道:
> > > > > >
> > > > > > Hi Celeborn community,
> > > > > >
> > > > > >
> > > > > > This is a call for a vote to release Apache Celeborn (Incubating)
> > > > > > 0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The git tag to be voted upon:
> > > > > >
> > >
> >
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The git commit hash:
> > > > > > d43411b22adf24679c27004a08e813ab278eaaa3 source and binary
> > artifacts
> > > can be
> > > > > > found at:
> > > > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.2-incubating-rc0
> > > > > >
> > > > > >
> > > > > > The staging repo:
> > > > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1041
> > > > > >
> > > > > >
> > > > > > Fingerprint of the PGP key release artifacts are signed with:
> > > > > > D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> > > > > >
> > > > > >
> > > > > > My public key to verify signatures can be found in:
> > > > > >
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> > > > > >
> > > > > >
> > > > > > The vote will be open for at least 72 hours or until the
> necessary
> > > > > > number of votes are reached.
> > > > > >
> > > > > >
> > > > > > Please vote accordingly:
> > > > > >
> > > > > >
> > > > > > [ ] +1 approve
> > > > > > [ ] +0 no opinion
> > > > > > [ ] -1 disapprove (and the reason)
> > > > > >
> > > > > >
> > > > > > Checklist for release:
> > > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> > > > > >
> > > > > >
> > > > > > Steps to validate the release:
> > > > > > https://www.apache.org/info/verification.html
> > > > > >
> > > > > >
> > > > > > * Download links, checksums and PGP signatures

Re: [DISCUSS] Time for 0.3.2

2023-12-06 Thread Mridul Muralidharan
+1 on 0.3.2, thanks Nicholas !

Regards,
Mridul


On Thu, Dec 7, 2023 at 12:51 AM Cheng Pan  wrote:

> +1, thanks for volunteering.
>
> Feel free to ping me if you encounter permission issues during the release
> phase.
>
> Thanks,
> Cheng Pan
>
>
> > On Dec 7, 2023, at 14:31, Nicholas  wrote:
> >
> > Hey, Celeborn community,
> >
> > It has been a while since the 0.3.1 release, and there are some critical
> fixes land branch-0.3, for example, [CELEBORN-1037] Incorrect output for
> metrics of Prometheus. From my perspective, it’s time to prepare for
> releasing 0.3.2.
> >
> > WDYT? And I’m volunteering to be the release manager if no one has
> applied.
> >
> > Regards,
> > Nicholas Jiang
>
>
>


Re: [ANNOUNCE] Add Yihe Li as new committer

2023-11-22 Thread Mridul Muralidharan
Congratulations Yihe Li !

Regards,
Mridul

On Wed, Nov 22, 2023 at 2:08 AM Yu Li  wrote:

> Congratulations, Yihe!
>
> Best Regards,
> Yu
>
>
> On Fri, 17 Nov 2023 at 15:32, Shaoyun Chen  wrote:
>
> > Congrats!
> >
> > Keyong Zhou  于2023年11月16日周四 20:25写道:
> > >
> > > Hi Celeborn Community,
> > >
> > > The Podling Project Management Committee (PPMC) for Apache Celeborn
> > > has invited Yihe Li to become a committer and we are pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A (P)PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Yihe Li!
> > >
> > > Thanks,
> > > Keyong Zhou
> >
>


Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-11-03 Thread Mridul Muralidharan
Yes, DAGScheduler is dealing with it at a stage level - and so individual
RDD’s DeterministicLevel  would be handled in order to determine the
stage’s level.

Regards,
Mridul


On Fri, Nov 3, 2023 at 9:45 AM Keyong Zhou  wrote:

> I checked RDD#getOutputDeterministicLevel and find that if an RDD's
> upstream is INDETERMINATE,
> then it's also INDETERMINATE.
>
> Thanks,
> Keyong Zhou
>
> Keyong Zhou  于2023年11月3日周五 19:57写道:
>
> > Hi Mridul,
> >
> > I still have a question. DAGScheduler#submitMissingTasks will
> > only unregisterAllMapAndMergeOutput
> > if the current ShuffleMapStage is Indeterminate. What if the current
> stage
> > is determinate, but its
> > upstream stage is Indeterminate, and its upstream stage is rerun?
> >
> > Thanks,
> > Keyong Zhou
> >
> > Mridul Muralidharan  于2023年10月20日周五 11:15写道:
> >
> >> To add my response - what I described (w.r.t failing job) applies only
> to
> >> ResultStage.
> >> It walks the lineage DAG to identify all indeterminate parents to
> >> rollback.
> >> If there are only ShuffleMapStages in the set of stages to rollback, it
> >> will simply discard their output, rollback all of them, and then retry
> >> these stages (same shuffle-id, a new stage attempt)
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
> >> wrote:
> >>
> >> >
> >> > Good question, and ResultStage is actually special cased in spark as
> its
> >> > output could have already been consumed (for example collect() to
> >> driver,
> >> > etc) - and so if it is one of the stages which needs to be rolled
> back,
> >> the
> >> > job is aborted.
> >> >
> >> > To illustrate, see the following:
> >> > -- snip --
> >> >
> >> > package org.apache.spark
> >> >
> >> >
> >> > import scala.reflect.ClassTag
> >> >
> >> > import org.apache.spark._
> >> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
> >> >
> >> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
> >> RDD[E](delegate) {
> >> >
> >> >   override def compute(split: Partition, context: TaskContext):
> >> Iterator[E] = {
> >> > delegate.compute(split, context)
> >> >   }
> >> >
> >> >   override protected def getPartitions: Array[Partition] =
> >> > delegate.partitions
> >> > }
> >> >
> >> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
> >> DelegatingRDD[E](delegate) {
> >> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
> >> DeterministicLevel.INDETERMINATE
> >> > }
> >> >
> >> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
> >> DelegatingRDD[E](delegate) {
> >> >   override def compute(split: Partition, context: TaskContext):
> >> Iterator[E] = {
> >> > val tc = TaskContext.get
> >> > if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
> >> tc.attemptNumber() == 0) {
> >> >   // Wait for all tasks to be done, then call exit
> >> >   Thread.sleep(5000)
> >> >   System.exit(-1)
> >> > }
> >> > delegate.compute(split, context)
> >> >   }
> >> > }
> >> >
> >> > // Make sure test_output directory is deleted before running this.
> >> > //
> >> > object Test {
> >> >
> >> >   def main(args: Array[String]): Unit = {
> >> > val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> >> > val sc = new SparkContext(conf)
> >> >
> >> > val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
> >> 20).map(v => (v, v)))
> >> > val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> >> > resultRdd.saveAsTextFile("test_output")
> >> >   }
> >> > }
> >> >
> >> > -- snip --
> >> >
> >> >
> >> >
> >> > Here, the mapper stage has been forced to be INDETERMINATE.
> >> > In the reducer stage, the first attempt to compute partition 0 will
> >> wait for a bit and then exit - since the master is a local-cluster, thi

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan
To add my response - what I described (w.r.t failing job) applies only to
ResultStage.
It walks the lineage DAG to identify all indeterminate parents to rollback.
If there are only ShuffleMapStages in the set of stages to rollback, it
will simply discard their output, rollback all of them, and then retry
these stages (same shuffle-id, a new stage attempt)


Regards,
Mridul



On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
wrote:

>
> Good question, and ResultStage is actually special cased in spark as its
> output could have already been consumed (for example collect() to driver,
> etc) - and so if it is one of the stages which needs to be rolled back, the
> job is aborted.
>
> To illustrate, see the following:
> -- snip --
>
> package org.apache.spark
>
>
> import scala.reflect.ClassTag
>
> import org.apache.spark._
> import org.apache.spark.rdd.{DeterministicLevel, RDD}
>
> class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends RDD[E](delegate) {
>
>   override def compute(split: Partition, context: TaskContext): Iterator[E] = 
> {
> delegate.compute(split, context)
>   }
>
>   override protected def getPartitions: Array[Partition] =
> delegate.partitions
> }
>
> class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends 
> DelegatingRDD[E](delegate) {
>   override def getOutputDeterministicLevel: DeterministicLevel.Value = 
> DeterministicLevel.INDETERMINATE
> }
>
> class FailingRDD[E: ClassTag](delegate: RDD[E]) extends 
> DelegatingRDD[E](delegate) {
>   override def compute(split: Partition, context: TaskContext): Iterator[E] = 
> {
> val tc = TaskContext.get
> if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 && 
> tc.attemptNumber() == 0) {
>   // Wait for all tasks to be done, then call exit
>   Thread.sleep(5000)
>   System.exit(-1)
> }
> delegate.compute(split, context)
>   }
> }
>
> // Make sure test_output directory is deleted before running this.
> //
> object Test {
>
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> val sc = new SparkContext(conf)
>
> val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1, 
> 20).map(v => (v, v)))
> val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> resultRdd.saveAsTextFile("test_output")
>   }
> }
>
> -- snip --
>
>
>
> Here, the mapper stage has been forced to be INDETERMINATE.
> In the reducer stage, the first attempt to compute partition 0 will wait for 
> a bit and then exit - since the master is a local-cluster, this results in 
> FetchFailure when the second attempt of partition 0 tries to fetch shuffle 
> data.
> When spark tries to regenerate parent shuffle output, it sees that the parent 
> is INDETERMINATE - and so fails the entire job.with the message:
> "
> org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle 
> map stage with indeterminate output was failed and retried. However, Spark 
> cannot rollback the ResultStage 1 to re-process the input data, and has to 
> fail this job. Please eliminate the indeterminacy by checkpointing the RDD 
> before repartition and try again.
> "
>
> This is coming from here 
> <https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
>  - when rolling back stages, if spark determines that a ResultStage needs to 
> be rolled back due to loss of INDETERMINATE output, it will fail the job.
>
> Hope this clarifies.
> Regards,
> Mridul
>
>
> On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:
>
>> In fact, I'm wondering if Spark will rerun the whole reduce
>> ShuffleMapStage
>> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
>>
>> Keyong Zhou  于2023年10月19日周四 23:00写道:
>>
>> > Thanks Erik for bringing up this question, I'm also curious about the
>> > answer, any feedback is appreciated.
>> >
>> > Thanks,
>> > Keyong Zhou
>> >
>> > Erik fang  于2023年10月19日周四 22:16写道:
>> >
>> >> Mridul,
>> >>
>> >> sure, I totally agree SPARK-25299 is a much better solution, as long
>> as we
>> >> can get it from spark community
>> >> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
>> >> celeborn already has spark-integration code with  [spark] scope)
>> >>
>> >> I also have a question about INDETERMINATE stage recompute, and may
>> need
>> >> your help
>>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Mridul Muralidharan
Good question, and ResultStage is actually special cased in spark as its
output could have already been consumed (for example collect() to driver,
etc) - and so if it is one of the stages which needs to be rolled back, the
job is aborted.

To illustrate, see the following:
-- snip --

package org.apache.spark


import scala.reflect.ClassTag

import org.apache.spark._
import org.apache.spark.rdd.{DeterministicLevel, RDD}

class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends RDD[E](delegate) {

  override def compute(split: Partition, context: TaskContext): Iterator[E] = {
delegate.compute(split, context)
  }

  override protected def getPartitions: Array[Partition] =
delegate.partitions
}

class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
DelegatingRDD[E](delegate) {
  override def getOutputDeterministicLevel: DeterministicLevel.Value =
DeterministicLevel.INDETERMINATE
}

class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
DelegatingRDD[E](delegate) {
  override def compute(split: Partition, context: TaskContext): Iterator[E] = {
val tc = TaskContext.get
if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
tc.attemptNumber() == 0) {
  // Wait for all tasks to be done, then call exit
  Thread.sleep(5000)
  System.exit(-1)
}
delegate.compute(split, context)
  }
}

// Make sure test_output directory is deleted before running this.
//
object Test {

  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
val sc = new SparkContext(conf)

val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
20).map(v => (v, v)))
val resultRdd = new FailingRDD(mapperRdd.groupByKey())
resultRdd.saveAsTextFile("test_output")
  }
}

-- snip --



Here, the mapper stage has been forced to be INDETERMINATE.
In the reducer stage, the first attempt to compute partition 0 will
wait for a bit and then exit - since the master is a local-cluster,
this results in FetchFailure when the second attempt of partition 0
tries to fetch shuffle data.
When spark tries to regenerate parent shuffle output, it sees that the
parent is INDETERMINATE - and so fails the entire job.with the
message:
"
org.apache.spark.SparkException: Job aborted due to stage failure: A
shuffle map stage with indeterminate output was failed and retried.
However, Spark cannot rollback the ResultStage 1 to re-process the
input data, and has to fail this job. Please eliminate the
indeterminacy by checkpointing the RDD before repartition and try
again.
"

This is coming from here
<https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
- when rolling back stages, if spark determines that a ResultStage
needs to be rolled back due to loss of INDETERMINATE output, it will
fail the job.

Hope this clarifies.
Regards,
Mridul


On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:

> In fact, I'm wondering if Spark will rerun the whole reduce ShuffleMapStage
> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
>
> Keyong Zhou  于2023年10月19日周四 23:00写道:
>
> > Thanks Erik for bringing up this question, I'm also curious about the
> > answer, any feedback is appreciated.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Erik fang  于2023年10月19日周四 22:16写道:
> >
> >> Mridul,
> >>
> >> sure, I totally agree SPARK-25299 is a much better solution, as long as
> we
> >> can get it from spark community
> >> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
> >> celeborn already has spark-integration code with  [spark] scope)
> >>
> >> I also have a question about INDETERMINATE stage recompute, and may need
> >> your help
> >> The rule for INDETERMINATE ShuffleMapStage rerun is reasonable,
> however, I
> >> don't find related logic for INDETERMINATE ResultStage rerun in
> >> DAGScheduler
> >> If INDETERMINATE ShuffleMapStage got entirely recomputed, the
> >> corresponding ResultStage should be entirely recomputed as well, per my
> >> understanding
> >>
> >> I found https://issues.apache.org/jira/browse/SPARK-25342 to rollback a
> >> ResultStage but it was not merged
> >> Do you know any context or related ticket for INDETERMINATE ResultStage
> >> rerun?
> >>
> >> Thanks in advance!
> >>
> >> Regards,
> >> Erik
> >>
> >> On Tue, Oct 17, 2023 at 4:23 AM Mridul Muralidharan 
> >> wrote:
> >>
> >> >
> >> >
> >> > On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:
> >> >
> >> >> Hi Mridu

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-16 Thread Mridul Muralidharan
On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:

> Hi Mridul,
>
> For a),
> DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier()
> <https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1975>
> to decide whether the whole stage needs to be recomputed
> I think we can pass the same information to Celeborn in
> ShuffleManager.registerShuffle()
> <https://github.com/apache/spark/blob/721ea9bbb2ff77b6d2f575fdca0aeda84990cc3b/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala#L39>,
>  since
> RDD in ShuffleDependency contains the RDD object
> It seems Stage.isIndeterminate() is unreadable from ShuffleDependency, but
> luckily rdd is used internally
>
> def isIndeterminate: Boolean = {
>   rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
> }
>
> Relies on internal implementation is not good, but doable.
> I don't expect spark RDD/Stage implementation changes frequently, and we
> can discuss with Spark community for a RDD isIndeterminate API if they
> change it in the future
>


Only RDD.getOutputDeterministicLevel is publicly exposed,
RDD.outputDeterministicLevel is not and it is private[spark].
While I dont expect changes to this, it is inherently unstable to depend on
it.

Btw, please see the discussion with Sungwoo Park, if Celeborn is
maintaining a reducer oriented view, you will need to recompute all the
mappers anyway - what you might save is the subset of reducer partitions
which can be skipped if it is DETERMINATE.




>
> for c)
> I also considered a similar solution in celeborn
> Celeborn (LifecycleManager) can get the full picture of remaining shuffle
> data from previous stage attempt and reuse it in stage recompute
> , and the whole process will be transparent to Spark/DagScheduler
>

Celeborn does not have visibility into this - and this is potentially
subject to invasive changes in Apache Spark as it evolves.
For example, I recently merged a couple of changes which would make this
different in master compared to previous versions.
Until the remote shuffle service SPIP is implemented and these are
abstracted out & made pluggable, it will continue to be quite volatile.

Note that the behavior for 3.5 and older is known - since Spark versions
have been released - it is the behavior in master and future versions of
Spark which is subject to change.
So delivering on SPARK-25299 would future proof all remote shuffle
implementations.


Regards,
Mridul



>
> Per my perspective, leveraging partial stage recompute and
> remaining shuffle data needs a lot of work to do in Celeborn
> I prefer to implement a simple whole stage recompute first with interface
> defined with recomputeAll = true flag, and explore partial stage recompute
> in seperate ticket as future optimization
> How do you think about it?
>
> Regards,
> Erik
>
>
> On Sat, Oct 14, 2023 at 4:50 PM Mridul Muralidharan 
> wrote:
>
>>
>>
>> On Sat, Oct 14, 2023 at 3:49 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> A reducer oriented view of shuffle, especially without replication,
>>> could indeed be susceptible to this issue you described (a single fetch
>>> failure would require all mappers to need to be recomputed) - note, not
>>> necessarily all reducers to be recomputed though.
>>>
>>> Note that I have not looked much into Celeborn specifically on this
>>> aspect yet, so my comments are *fairly* centric to Spark internals :-)
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Sat, Oct 14, 2023 at 3:36 AM Sungwoo Park  wrote:
>>>
>>>> Hello,
>>>>
>>>> (Sorry for sending the same message again.)
>>>>
>>>> From my understanding, the current implementation of Celeborn makes it
>>>> hard to find out which mapper should be re-executed when a partition cannot
>>>> be read, and we should re-execute all the mappers in the upstream stage. If
>>>> we can find out which mapper/partition should be re-executed, the current
>>>> logic of stage recomputation could be (partially or totally) reused.
>>>>
>>>> Regards,
>>>>
>>>> --- Sungwoo
>>>>
>>>> On Sat, Oct 14, 2023 at 5:24 PM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>   Spark will try to minimize the recomputation cost as much as
>>>>> possible.
>>>>> For example, if parent stage was DETERMINATE, it simply needs to
>>>>> recompute the m

Re: Question on Celeborn workers,

2023-10-16 Thread Mridul Muralidharan
With push based shuffle in Apache Spark (magnet), we have both the map
output and reducer orientated merged output preserved - with reducer
oriented view chosen by default for reads and fallback to mapper output
when reducer output is missing/failures. That mitigates this specific issue
for DETERMINATE stages and only subset which need recomputation are
regenerated.
With magnet only smaller blocks are pushed for merged data, so effective
replication is much lower.

In our Celeborn deployment we are still testing, we will enable replication
for functional and operational reasons - probably move replication out of
the write path to speed it up further.


Regards,
Mridul

On Mon, Oct 16, 2023 at 9:08 AM Keyong Zhou  wrote:

> Hi Sungwoo,
>
> What you are pointing out is very correct. Currently shuffle data
> is distributed across `celeborn.master.slot.assign.maxWorkers` workers,
> which defaults to 1, so I believe the cascading stage rerun will
> definitely happen.
>
> I think setting ` celeborn.master.slot.assign.maxWorkers` to a smaller
> value can help, especially in relatively large clusters. Turning on
> replication
> can also help, but it conflicts with the purpose why we do stage rerun
> (i.e. we
> don't want to turn on replication for resource consumption reason).
>
> We didn't thought about this before, thanks for pointing that out!
>
> Thanks,
> Keyong Zhou
>
> Sungwoo Park  于2023年10月13日周五 02:22写道:
>
> > I have a question on how Celeborn distributes shuffle data among Celeborn
> > workers.
> >
> > From our observation, it seems that whenever a Celeborn worker fails or
> > gets killed (in a small cluster of less than 25 nodes), almost every edge
> > is affected. Does this mean that an edge with multiple partitions usually
> > distributes its shuffle data among all Celeborn workers?
> >
> > If this is the case, I think stage recomputation is unnecessary and just
> > re-executing the entire DAG is a better approach. Our current
> > implementation uses the following scheme for stage recomputation:
> >
> > 1. If a read failure occurs for shuffleId #1 for an edge, we pick up a
> new
> > shuffleId #2 for the same edge.
> > 2. The upstream stage re-executes all tasks, but writes the output to
> > shuffleId #2.
> > 3. Tasks in the downstream stage re-try by reading from shuffleId #2.
> >
> > From our experiment, whenever a Celeborn worker fails and a read failure
> > occurs for an edge, the re-execution of the upstream stage usally ends up
> > with another read failure because some part of its input has also been
> > lost. As a result, all upstream stages are eventually re-executed in a
> > cascading manner. In essence, the failure of a Celeborn worker
> invalidates
> > all existing shuffleIds.
> >
> > (This is what we observed with Hive-MR3-Celeborn, but I guess stage
> > recomputation in Spark will have to deal with the same problem.)
> >
> > Thanks,
> >
> > --- Sungwoo
> >
>


Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
On Sat, Oct 14, 2023 at 3:49 AM Mridul Muralidharan 
wrote:

>
> A reducer oriented view of shuffle, especially without replication, could
> indeed be susceptible to this issue you described (a single fetch failure
> would require all mappers to need to be recomputed) - note, not necessarily
> all reducers to be recomputed though.
>
> Note that I have not looked much into Celeborn specifically on this aspect
> yet, so my comments are *fairly* centric to Spark internals :-)
>
> Regards,
> Mridul
>
>
> On Sat, Oct 14, 2023 at 3:36 AM Sungwoo Park  wrote:
>
>> Hello,
>>
>> (Sorry for sending the same message again.)
>>
>> From my understanding, the current implementation of Celeborn makes it
>> hard to find out which mapper should be re-executed when a partition cannot
>> be read, and we should re-execute all the mappers in the upstream stage. If
>> we can find out which mapper/partition should be re-executed, the current
>> logic of stage recomputation could be (partially or totally) reused.
>>
>> Regards,
>>
>> --- Sungwoo
>>
>> On Sat, Oct 14, 2023 at 5:24 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>>   Spark will try to minimize the recomputation cost as much as possible.
>>> For example, if parent stage was DETERMINATE, it simply needs to
>>> recompute the missing (mapper) partitions (which resulted in fetch
>>> failure). Note, this by itself could require further recomputation in the
>>> DAG if the inputs required to comput the parent partitions are missing, and
>>> so on - so it is dynamic.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Oct 14, 2023 at 2:30 AM Sungwoo Park 
>>> wrote:
>>>
>>>> > a) If one or more tasks for a stage (and so its shuffle id) is going
>>>> to be
>>>> > recomputed, if it is an INDETERMINATE stage, all shuffle output will
>>>> be
>>>> > discarded and it will be entirely recomputed (see here
>>>> > <
>>>> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1477
>>>> >
>>>> > ).
>>>>
>>>> If a reducer (in a downstream stage) fails to read data, can we find
>>>> out
>>>> which tasks should recompute their output? From the previous
>>>> discussion, I
>>>> thought this was hard (in the current implementation), and we should
>>>> re-execute all tasks in the upstream stage.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo
>>>>
>>>


Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
A reducer oriented view of shuffle, especially without replication, could
indeed be susceptible to this issue you described (a single fetch failure
would require all mappers to need to be recomputed) - note, not necessarily
all reducers to be recomputed though.

Note that I have not looked much into Celeborn specifically on this aspect
yet, so my comments are failure centric to Spark internals :-)

Regards,
Mridul


On Sat, Oct 14, 2023 at 3:36 AM Sungwoo Park  wrote:

> Hello,
>
> (Sorry for sending the same message again.)
>
> From my understanding, the current implementation of Celeborn makes it
> hard to find out which mapper should be re-executed when a partition cannot
> be read, and we should re-execute all the mappers in the upstream stage. If
> we can find out which mapper/partition should be re-executed, the current
> logic of stage recomputation could be (partially or totally) reused.
>
> Regards,
>
> --- Sungwoo
>
> On Sat, Oct 14, 2023 at 5:24 PM Mridul Muralidharan 
> wrote:
>
>>
>> Hi,
>>
>>   Spark will try to minimize the recomputation cost as much as possible.
>> For example, if parent stage was DETERMINATE, it simply needs to
>> recompute the missing (mapper) partitions (which resulted in fetch
>> failure). Note, this by itself could require further recomputation in the
>> DAG if the inputs required to comput the parent partitions are missing, and
>> so on - so it is dynamic.
>>
>> Regards,
>> Mridul
>>
>> On Sat, Oct 14, 2023 at 2:30 AM Sungwoo Park 
>> wrote:
>>
>>> > a) If one or more tasks for a stage (and so its shuffle id) is going
>>> to be
>>> > recomputed, if it is an INDETERMINATE stage, all shuffle output will be
>>> > discarded and it will be entirely recomputed (see here
>>> > <
>>> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1477
>>> >
>>> > ).
>>>
>>> If a reducer (in a downstream stage) fails to read data, can we find out
>>> which tasks should recompute their output? From the previous discussion,
>>> I
>>> thought this was hard (in the current implementation), and we should
>>> re-execute all tasks in the upstream stage.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>


Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
Hi,

  Spark will try to minimize the recomputation cost as much as possible.
For example, if parent stage was DETERMINATE, it simply needs to recompute
the missing (mapper) partitions (which resulted in fetch failure). Note,
this by itself could require further recomputation in the DAG if the inputs
required to comput the parent partitions are missing, and so on - so it is
dynamic.

Regards,
Mridul

On Sat, Oct 14, 2023 at 2:30 AM Sungwoo Park  wrote:

> > a) If one or more tasks for a stage (and so its shuffle id) is going to
> be
> > recomputed, if it is an INDETERMINATE stage, all shuffle output will be
> > discarded and it will be entirely recomputed (see here
> > <
> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1477
> >
> > ).
>
> If a reducer (in a downstream stage) fails to read data, can we find out
> which tasks should recompute their output? From the previous discussion, I
> thought this was hard (in the current implementation), and we should
> re-execute all tasks in the upstream stage.
>
> Thanks,
>
> --- Sungwoo
>


Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-13 Thread Mridul Muralidharan
Hi,

  So there are a couple of things here based on whether the stages are
DETERMINATE or INDETERMINATE.
The exit I added to my example was to trigger some of these cases, and you
can come up with more involved scenarios where this would apply :-)

At a high level, we have the following:

a) If one or more tasks for a stage (and so its shuffle id) is going to be
recomputed, if it is an INDETERMINATE stage, all shuffle output will be
discarded and it will be entirely recomputed (see here

).
b) Conversely, if it was a DETERMINATE stage, only the missing partitions
for that shuffle id will be recomputed.
c) This also impacts whether we can use a task's output if it was for an
earlier stage attempt or not - we can use it for an older attempt if
DETERMINATE, and cant for INDETERMINATE (see here
,
here

).
(We can skip the more involved cases, should not be relevant for Celeborn
IMO).

For a given partition P, for a spark-shuffle-id, MapOutputTracker state is
updated by DAGScheduler keeping these in mind.
Assuming I did not misunderstand the proposal, if each stage attempt
corresponds to a celeborn-shuffle-id, such that we have spark-shuffle-id ->
List[celeborn-shuffle-id], we will need to take these cases above into
account as well when a "reducer" is requesting for spark-shuffle-id,
partition P output (which celeborn-shuffle-id it maps to).


Regards,
Mridul





On Thu, Oct 12, 2023 at 3:14 AM Erik fang  wrote:

> Hi Mridul,
>
> sorry for the late reply
>
> Per my understanding, the key point about Spark shuffleId and
> StageId/StageAttemptId is,
> shuffleId is assigned at ShuffleDependency creation time and bounded to
> the RDD/ShuffleDependency, while StageId/StageAttemptId is assigned and
> changes at Job execution time
> In the RDD example, there are two shuffle data with id 0 and 1,  and those
> shuffle data are expected to be accessed(read/write) with the correct
> shuffle id
> and we can do that in celeborn with shuffle id mapping
>
> I made some small modification to the example to avoid exit , and grab
> some logs for Spark DAGScheduler/Celeborn LifecycleManager to help explain
>
> import org.apache.spark.TaskContext
>
> val rdd1 = sc.parallelize(0 until 1, 20).map(v => (v, v)).groupByKey()
> val rdd2 = rdd1.mapPartitions { iter =>
>   val tc = TaskContext.get()
>   println("print stageAttemptNumber " + tc.stageAttemptNumber())
>   iter
> }
>
> rdd2.count()
> rdd2.map(v => (v._1, v._2)).groupByKey().count()
>
> *// DAGScheduler starts job 0, submit ShuffleMapStage 0 for spark
> shuffle-0*
> 23/10/12 12:58:14 INFO DAGScheduler: Registering RDD 1 (map at :24)
> as input to shuffle 0
> 23/10/12 12:58:14 INFO DAGScheduler: Got job 0 (count at :25)
> with 20 output partitions
> 23/10/12 12:58:14 INFO DAGScheduler: Final stage: ResultStage 1 (count at
> :25)
> 23/10/12 12:58:14 INFO DAGScheduler: Parents of final stage:
> List(ShuffleMapStage 0)
> 23/10/12 12:58:14 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
> 0)
> 23/10/12 12:58:14 DEBUG DAGScheduler: submitStage(ResultStage 1
> (name=count at :25;jobs=0))
> 23/10/12 12:58:14 DEBUG DAGScheduler: missing: List(ShuffleMapStage 0)
> 23/10/12 12:58:14 DEBUG DAGScheduler: submitStage(ShuffleMapStage 0
> (name=map at :24;jobs=0))
> 23/10/12 12:58:14 DEBUG DAGScheduler: missing: List()
> 23/10/12 12:58:14 INFO DAGScheduler: Submitting ShuffleMapStage 0
> (MapPartitionsRDD[1] at map at :24), which has no missing parents
> 23/10/12 12:58:14 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 0)
>
> *// LifecycleManager received GetShuffleId request from ShuffleMapStage 0
> with spark_shuffleId 0, stage attemptId 0, and generate celeborn_shuffleId
> 0 for write*
> 23/10/12 12:58:16 DEBUG LifecycleManager: Received GetShuffleId
> request,appShuffleId 0 maxAttemptNum 4 attemptId 0 isShuffleWriter true
> 23/10/12 12:58:16 DEBUG LifecycleManager: Received RegisterShuffle
> request, 0, 20, 20.
> 23/10/12 12:58:16 INFO LifecycleManager: New shuffle request, shuffleId 0,
> partitionType: REDUCE numMappers: 20, numReducers: 20.
>
> *// ShuffleMapStage finish, Submit ResultStage 1*
> 23/10/12 12:58:18 INFO LifecycleManager: Received StageEnd request,
> shuffleId 0.
> 23/10/12 12:58:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
> 23/10/12 12:58:18 INFO DAGScheduler: ShuffleMapStage 0 (map at :24)
> finished in 4.246 s
> 23/10/12 12:58:18 INFO DAGScheduler: looking for newly runnable stages
> 23/10/12 12:58:18 INFO DAGScheduler: running: Set()
> 23/10/12 12:58:18 I

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-09-23 Thread Mridul Muralidharan
Hi,

  I am not yet very familiar with Celeborn, so will restrict my notes on
the proposal in context to Apache Spark:

a) For Option 1, there is SPARK-25299 - which was started a few years back.
Unfortunately, the work there has stalled: but if there is interest in
pushing that forward, I can help shepard the contributions !
Full disclosure, the earlier proposal might be fairly outdated, and will
involve a bit of investment to restart that work.

b) On the ability to reuse a previous mapper output/minimize cost - that
depends on a stage's DeterministicLevel.
DETERMINATE mapper stage output can be reused, and not others - and there
is a lot of nuance around how DAGScheduler handles it.
Lot of it has to do with data correctness (See SPARK-23243 and the PR's
linked there for more indepth analysis of this) - and this has kept
evolving in the years since.
DAGScheduler directly updates MapOutputTracker for a few cases - which
includes for this.

c) As a follow up to (b) above, even though MapOutputTracker is part of
SparkEnv, and so 'accessible', I would be careful modifying its state
directly outside of DAGScheduler.

d) The computation for "celeborn shuffleId" would not work - since
spark.stage.maxConsecutiveAttempts is for consecutive failures for a single
stage in a job.
The same shuffle id can be computed by different stages across jobs (for
example: very common with Spark SQL AQE btw).
A simple example here [1]


Other than Option 1, the rest look like a tradeoff to varying degrees.
I am not familiar enough with Celeborn to give good suggestions yet though.


All the best in trying to solve this issue - looking forward to updates !

Regards,
Mridul

[1]
Run with './bin/spark-shell  --master 'local-cluster[4, 3, 1024]'' or
yarn/k8s

import org.apache.spark.TaskContext

val rdd1 = sc.parallelize(0 until 1, 20).map(v => (v, v)).groupByKey()
val rdd2 = rdd1.mapPartitions { iter =>
  val tc = TaskContext.get()
  if (0 == tc.partitionId() && tc.stageAttemptNumber() < 1) {
System.exit(0)
  }
  iter
}

rdd2.count()
rdd2.map(v => (v._1, v._2)).groupByKey().count()

For both the jobs, the same shuffle id is used for the first shuffle.



On Fri, Sep 22, 2023 at 10:48 AM Erik fang  wrote:

> Hi folks,
>
> I have a proposal to implement Spark stage resubmission to handle shuffle
> fetch failure in Celeborn
>
>
> https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8
>
> please have a look and let me know what you think
>
> Regards,
> Erik
>


Re: [DISCUSSION] Support memory file storage.

2023-09-21 Thread Mridul Muralidharan
Thanks for clarifying Ethan, sounds good to me !

Regards,
Mridul

On Wed, Sep 20, 2023 at 8:58 PM Ethan Feng  wrote:

> Hi Mridul,
>
> Thank you for your email and your positive feedback on the proposed
> enhancement to Celeborn. I'm glad you find it promising.
>
> To address your queries:
>
> a) The proposed enhancement is intended to act as a storage tire, not
> as a cache. However, it may have certain elements of both. Celeborn
> currently won't store a whole shuffle file in memory and requires a
> shuffle file to be written to disks or HDFS before the client can
> read. This proposal will allow the client to read a shuffle file from
> the worker's memory directly. I hope this clarifies things for you.
>
> b) While your suggestion for a tiered storage layer is interesting, it
> is a superset of this proposal. As you can see there is an
> issue(https://github.com/apache/incubator-celeborn/issues/146).
> Celeborn treats a shuffle partition as a shuffle file instead of
> segments so a shuffle partition will not be distributed to multiple
> storage tiers. There will be another proposal to discuss how will
> Celeborn move existing shuffle files to different storage tires.
>
> c) As mentioned above, the enhancement is intended to act as a storage
> tier that's why I explained the details about how it is handled
> internally.
>
> Thanks again for your email. Please let me know if you have any
> further questions or concerns.
>
> Regards,
> Ethan
>
> Mridul Muralidharan  于2023年9月21日周四 01:09写道:
> >
> > Hi,
> >
> >   This should be a nontrivial improvement to Celeborn imo, thanks Ethan !
> >
> > I had a few queries:
> >
> > a) Are we viewing this enhancement as a cache or as a tiered storage
> layer ?
> > When going over it, I felt the proposal might be doing both - though
> > leaning more as a cache, but wanted to get clarity.
> >
> > b) If we are modelling it as a tiered storage layer, it would be good to
> > also think about what the right abstractions should be and not special
> case
> > it just for memory.
> > For example:
> > Memory -> NVME/SSD -> Spinning Disk -> HDFS/S3
> > (With one or more being missing in a deployment)
> >
> > This would unify the way we handle evictions from one level to the next
> > with a tiered view of the storage layer.
> > Complexity of the implementation is definitely a consideration here
> though.
> >
> > Note, this might be out of scope for this proposal and work for the
> future
> > as well - wanted to get your thoughts if it was considered !
> >
> > c) If modelling as a cache, we should change the abstractions in the
> > proposal slightly and hide the details behind the cache implementation.
> > Read and write path would not need to worry about how it is handled
> > internally.
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> >
> > On Tue, Sep 19, 2023 at 10:27 PM Ethan Feng 
> wrote:
> >
> > > Hello Celeborn community,
> > >
> > > I have a proposal to support memory file storage in Celeborn:
> > >
> > >
> https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing
> > >
> > > Would really appreciate feedback from the community on this proposal.
> > >
> > >
> > > Thanks
> > > Ethan
> > >
>


Re: [DISCUSSION] Support memory file storage.

2023-09-20 Thread Mridul Muralidharan
Hi,

  This should be a nontrivial improvement to Celeborn imo, thanks Ethan !

I had a few queries:

a) Are we viewing this enhancement as a cache or as a tiered storage layer ?
When going over it, I felt the proposal might be doing both - though
leaning more as a cache, but wanted to get clarity.

b) If we are modelling it as a tiered storage layer, it would be good to
also think about what the right abstractions should be and not special case
it just for memory.
For example:
Memory -> NVME/SSD -> Spinning Disk -> HDFS/S3
(With one or more being missing in a deployment)

This would unify the way we handle evictions from one level to the next
with a tiered view of the storage layer.
Complexity of the implementation is definitely a consideration here though.

Note, this might be out of scope for this proposal and work for the future
as well - wanted to get your thoughts if it was considered !

c) If modelling as a cache, we should change the abstractions in the
proposal slightly and hide the details behind the cache implementation.
Read and write path would not need to worry about how it is handled
internally.


Regards,
Mridul





On Tue, Sep 19, 2023 at 10:27 PM Ethan Feng  wrote:

> Hello Celeborn community,
>
> I have a proposal to support memory file storage in Celeborn:
>
> https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing
>
> Would really appreciate feedback from the community on this proposal.
>
>
> Thanks
> Ethan
>


Re: [DISCUSS] Support authentication in Celeborn

2023-09-18 Thread Mridul Muralidharan
  To add to what Chandni mentioned, using self-signed certificates and
trusting them is another (though less secure) practice some deployments
leverage.
This ensures encryption over the wire, but does not allow for clients to
validate identity of the Celeborn server components (so potentially liable
to DNS spoofing, MITM attacks, etc).
This might or might not be acceptable to deployments.

Note that the proposal calls securing with TLS as strongly recommended, but
not mandatory.

Regards,
Mridul


On Mon, Sep 18, 2023 at 11:37 AM Chandni Singh 
wrote:

> Hi Zhongqiang,
> Yes, you are right. TLS implementation relies on digital certificates which
> are usually obtained from a trusted CA.
> In my experience, many organizations establish their own internal CAs to
> issue certificates for their internal networks, thus acting as trusted
> issuers for various services within the organization.
>
> In scenarios where an internal CA infrastructure is not available and we
> want to avoid a public trusted CA because they are paid, services may
> resort to using self-signed certificates. To establish trust in these
> self-signed certificates, clients must be explicitly configured to
> recognize them — either by installing them into the client's native trust
> store or by using a custom trust store that includes these certificates.
> These certificates are securely distributed to all relevant client-hosting
> machines using an out-of-band method. Once the trust store is properly
> configured, the client-side TLS settings can be adjusted to reference this
> trust store, thereby ensuring secure communication
>
> Chandni
>
> On Mon, Sep 18, 2023 at 5:18 AM Zhongqiang Chen  >
> wrote:
>
> >
> >
> >
> >
> >
> > Hi Chandni,
> >
> > I have a question about how to implement TLS handshake and how to obtain
> > the certificate?
> > Based on my understanding, TLS implementation generally relies on digital
> > certificates which are obtained from a trusted certificate authority
> (CA).
> > It requires some money to obtain a CA certificate.
> > Thanks,
> > Zhongqiang Chen
> >
> >
> >
> > At 2023-09-15 06:34:02, "Chandni Singh"  wrote:
> > >Hello Celeborn community,
> > >
> > >We have a proposal to add authentication to Celeborn:
> > >
> >
> https://docs.google.com/document/d/1D1U2COYhS3ob7l0t2WghRhBk_Fci9RGx-2FBXA3nvXk/edit#heading=h.m97qw1fpl5kv
> > >
> > >Would really appreciate feedback from the community on this proposal.
> > >
> > >Please let me know if there is a particular format that the Celeborn
> > >community follows for proposals and I will convert it into that format.
> > >
> > >Thank you
> > >Chandni
> >
>


Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc0

2023-08-31 Thread Mridul Muralidharan
+1

Signatures, digests, license, etc check out fine.
Checked out tag and build/tested with -Pspark-3.1

Regards,
Mridul


On Thu, Aug 31, 2023 at 11:35 AM Cheng Pan  wrote:

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.3.1-incubating-rc0
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating-rc0
>
> The git commit hash:
> b3992274e207959125d8784d9b61a6e8043612fc source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.1-incubating-rc0
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1037
>
> Fingerprint of the PGP key release artifacts are signed with:
> 8FC8075E1FDC303276C676EE8001952629BCC75D
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Cheng Pan
>


Re: [DISCUSS] Allow external contributors to run CI without approval

2023-06-16 Thread Mridul Muralidharan
Agree, +1

Regards,
Mridul

On Fri, Jun 16, 2023 at 9:16 AM Cheng Pan  wrote:

> +1 for "only requires approval first time"
>
> Keyong Zhou  于 2023年6月16日周五 下午5:48写道:
>
> > +1
> >
> > Thanks,
> > Keyong Zhou
> >
> > Ethan Feng  于2023年6月16日周五 16:27写道:
> >
> > > Recent moves by Apache Infra have changed the policy on GitHub Actions
> > from
> > > "Only requires approval first time" to "Requires approval every time".
> > >
> > > I think this is not friendly for getting folks involved in
> > > the project and this increased the cost for committers to process the
> > > pull requests.
> > >
> > > Please respond to this thread if you are in support of going back to
> > > "Only requires approval the first time" or if you don't believe this
> is a
> > > good idea please respond as well.
> > >
> > > Thanks,
> > > Ethan Feng
> > >
> >
>