from:"keyong zhou"

Re: [VOTE] Release Apache Celeborn 0.4.2-rc1

2024-07-25 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- signatures are good.
```
gpg --import KEYS
gpg --verify apache-celeborn-0.4.2-bin.tgz.asc
gpg --verify apache-celeborn-0.4.2-source.tgz.asc
```
- checksums are good.
```
sha512sum --check apache-celeborn-0.4.2-bin.tgz.sha512
sha512sum --check apache-celeborn-0.4.2-source.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- build success from source code (macOS).
```
./build/make-distribution.sh --sbt-enabled --release
```

Regards,
Keyong Zhou


Fu Chen  于2024年7月22日周一 18:14写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn 0.4.2-rc1
>
>
> The git tag to be voted upon:
> https://github.com/apache/celeborn/releases/v0.4.2-rc1
>
>
> The git commit hash:
> d3639bb3c3d4cb2d224f1d6542e2d20d3047c76e
>
> Source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/celeborn/v0.4.2-rc1/
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1079
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Thanks,
> Fu Chen
>

Re: [ANNOUNCE] New Celeborn PMC Member: Nicholas Jiang

2024-07-23 Thread Keyong Zhou

Congrats!

Regards,
Keyong Zhou

angers zhu  于2024年7月23日周二 18:09写道：

> Congrats!
>
>
> Thanks
> Angerszh
>
> Cheng Pan  于2024年7月23日周二 18:01写道：
>
> > Congrats!
> >
> > Thanks,
> > Cheng Pan
> >
> > On Tue, Jul 23, 2024 at 5:34 PM rexxiong  wrote:
> > >
> > > Hi Celeborn Community,
> > >
> > > The Project Management Committee (PMC) for Apache Celeborn
> > > has invited Nicholas Jiang to become a PMC member and we are pleased
> > > to announce that he has accepted.
> > >
> > > A PMC member helps manage and guide the direction of the project.
> > > We are glad to see his more interactions with the community in the
> > future.
> > >
> > > Please join me in congratulating Nicholas!
> > >
> > >
> > > Thanks,
> > > Jiashu Xiong
> >
>

Re: [ANNOUNCE] New Celeborn Committer: Fei Wang

2024-07-22 Thread Keyong Zhou

Congratulations!

Regards,
Keyong Zhou

angers zhu  于2024年7月23日周二 12:07写道：

> Congratulations！
>
> Shaoyun Chen  于2024年7月23日周二 11:15写道：
>
> > Congratulations!
> >
> > Cheng Pan  于2024年7月23日周二 11:05写道：
> > >
> > > Hi Celeborn Community,
> > >
> > > The Project Management Committee (PMC) for Apache Celeborn
> > > has invited Fei Wang to become a committer and we are pleased
> > > to announce that he has accepted.
> > >
> > > Being a committer enables easier contribution to the
> > > project since there is no need to go via the patch
> > > submission process. This should enable better productivity.
> > > A PMC member helps manage and guide the direction of the project.
> > >
> > > Please join me in congratulating Fei!
> > >
> > > Thanks,
> > > Cheng Pan
> >
>

Re: [VOTE] Release Apache Celeborn 0.4.2-rc0

2024-07-20 Thread Keyong Zhou

Hi Fu,

I wonder why change all ${project.version} to 0.4.2 in [1] instead of just
change
the definition of  to 0.4.2 like [2] does?

Regards,
Keyong Zhou

Fu Chen  于2024年7月17日周三 00:23写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn 0.4.2-rc0
>
>
> The git tag to be voted upon:
> https://github.com/apache/celeborn/releases/v0.4.2-rc0
>
>
> The git commit hash:
> 2181dd7d17711d529aa618e25a1f558652e480c7
>
> Source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/celeborn/v0.4.2-rc0/
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1078
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Thanks,
> Fu Chen
>

Re: [DISCUSS] CIP-10: Introduce Celeborn Chaos Testing Framework

2024-07-14 Thread Keyong Zhou

Thanks for the proposal!

The chaos framework is very useful for Celeborn, there are two points I
think are important:
1. We need to add correctness check in the framework, correctness is NO.1
important thing.
2. The framework should not intrude into the common code.

Regards,
Keyong Zhou

Nicholas Jiang  于2024年7月12日周五 14:29写道：

> Hey Mridul,
>
> Thanks for your feedback. The ability to reproduce problematic cases by
> capturing logs of events that have been triggered can maximize the value of
> chaos testing framework. Celeborn chaos testing not only needs to verify
> the reliability of the service under the background of simulating various
> abnormal events, but also reproduces problem cases to troubleshoot the root
> cause of Celeborn problems. I would like to take this reproduction feature
> into consideration for this CIP.
>
> Best Regards,
> Nicholas Jiang
>
> On 2024/07/10 09:35:52 Mridul Muralidharan wrote:
> > Hi,
> >
> >   This is a great idea - and would go a long way in flushing out bugs and
> > issues - and improving the overall robustness of Celeborn !
> > It would also be good to have:
> > a) Capture a (replay) log of all events which were triggered.
> > b) Ability to 'replay' the log and deterministically reach the same
> state.
> >
> > This will allow us to identify failure cases with the testing framework -
> > while allowing developers to deterministically reproduce the identified
> > state.
> >
> > (Hopefully I did not miss this in the proposal).
> >
> > Regards,
> > Mridul
> >
> >
> > On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang  >
> > wrote:
> >
> > > Hello community,
> > >
> > > It's been a while since the discussion on the Celeborn chaos testing
> > > framework. The main process of Celeborn chaos testing includes:
> > >
> > > 1. Defining a test plan to describe the types of events, the order in
> > > which events are triggered, and their duration. Event types include
> node
> > > anomalies, disk anomalies, IO anomalies, CPU overload, etc.
> > > 2. The client submits the plan to the scheduler.
> > > 3. The scheduler sends operations to each node's runner according to
> the
> > > plan description.
> > > 4. The runner is responsible for executing the operations and reporting
> > > the current status of the node.
> > > 5. Before triggering an operation, the scheduler deduces the result of
> > > this event. If it leads to the inability to meet the minimum runnable
> > > environment for RSS, the event is rejected.
> > >
> > > Do you have any thoughts or questions about this chaos testing
> framework?
> > > Welcome feedback to further ensure the reliability of Celeborn through
> > > chaos testing.
> > >
> > > Regards,
> > > Nicholas Jiang
> > >
> > > At 2024-07-03 05:20:57, "Nicholas Jiang" 
> wrote:
> > > >Hi all,
> > > >
> > > >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos
> > > Testing Framework[1].
> > > >
> > > >A chaos testing framework is designed to simulate unpredictable and
> > > adverse conditions in distributed systems to validate their robustness
> and
> > > resilience. This proposal aims to simulate various anomalies and test
> the
> > > stability of Celeborn in distributed environments via chaos testing.
> > > >
> > > >Looking forward to everyone's feedback and suggestions. Thank you!
> > > >
> > > >[1]
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> > > >
> > > >Regards,
> > > >Nicholas Jiang
> > >
> >
>

Re: Jira version update

2024-07-04 Thread Keyong Zhou

Thanks Mridul for pointing this out, I just modified 0.5.0 as released :)

Regards,
Keyong Zhou

Mridul Muralidharan  于2024年7月5日周五 03:13写道：

> Hi,
>
>   While updating an issue manually, I noticed that 0.5.0 is still mentioned
> as an unreleased version in jira.
> Given 0.5 release, we should be getting it updated ?
>
> Regards,
> Mridul
>

Re: [VOTE] CIP-9: Celeborn RESTful API Refine

2024-07-02 Thread Keyong Zhou

+1

Regards,
Keyong Zhou

Fei Wang  于2024年7月3日周三 02:07写道：

> Hi all,
>
> Thanks for all the feedback about the CIP-9 Celeborn RESTful API Refine
> [1].
> The discussion thread is here [2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or insufficient votes.
>
> [1]
>
> https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing
> [2] https://lists.apache.org/thread/mng0pxst0z4gc9gs7mc1frz4pzpk70jb
>
> Best Regards,
> Fei Wang
>

Re: Re:[DISCUSS] Celeborn RESTful API Refine Proposal

2024-07-01 Thread Keyong Zhou

Hi Fei,

Sorry for the late reply, I reviewed the design doc and it looks good to me
:)

In the doc you mentioned CIP-7 Celeborn CLI[1] that relies on the rest API,
since your design
will reserve the current API, I think there will be no conflicts.

I suggest to create a CIP for this proposal and starts a vote on it,
like[2][3].

Regards,
Keyong Zhou

[1] https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI
[2] https://lists.apache.org/thread/xjh8z2kszq0kwj5bdz2bh3b1sotv593p
[3] https://lists.apache.org/thread/bx58h25poypq0znolkb8vlhop4bw1x81

Fei Wang  于2024年7月2日周二 04:53写道：

> Hi celeborn community,
> I hope this message finds you well.
>
> I would like to extend my gratitude to those who have taken the time to
> review and discuss for this proposal. Your insights and feedback are
> invaluable to the progression of this project.
>
> As we have not received further discussions for some time, I believe it is
> appropriate to move forward with the next step in the process.
>
> Best Regards,
> Fei Wang
>
> On 2024/06/25 19:18:27 Fei Wang wrote:
> > Hi,
> >
> > Thanks Nicholas for the comments.
> >
> > > 1. Could you summary all /api/v1 interfaces in Public Interfaces
> section? Meanwhile, could you also the definition of the parameter and
> return type class like the fields of POJO?
> >
> > I have updated the docs and complete the parameters and response POJO.
> >
> > > 2. Could some interfaces merged into one interface like
> /${version}/workers/lost, /${version}/workers/excluded and
> /${version}/workers/shutdown? Should the refined REST API be mapping to
> origin interface one by one?
> >
> > Thanks we can merge these apis into `/${version}/workers`. The response
> POJO.
> > Name  Type
> > workers   List[WorkerData]
> > lostWorkers   List[WorkerData]
> > excludedWorkers   List[WorkerData]
> > shutdownWorkers   List[WorkerData]
> > decommissionWorkers   List[WorkerData]
> >
> > And I will mapping the the api one by one.
> >
> >
> > > 3. Could this migration plan describe more detail? For example, the
> origin interfaces returns string, but the refined REST API returns the
> POJO? How does the user migrate the REST API?
> >
> > The migration of a REST API primarily involves mapping the old API to
> the new API. Previously, the old API returned a string, whereas the new API
> returns a POJO (Plain Old Java Object) that includes all the fields
> relevant to the content of the previous string response. Therefore, the
> content of the new API's response is essentially consistent with the
> previous response results.
> > In the migration documentation, I will detail the API mappings as well
> as the fields in the new response.
> >
> >
> > > 4. Some interfaces like /${version}/exit do not mentation the HTTP
> method? Could you check all the HTTP method of refined REST API? Meanwhile,
> is there any standard or pattern of the naming for path? For example,
> /${version}/workers/events not only list the event info for Get method, but
> also supports sending event for POST method without operation name in path.
> >
> > Thanks for the reminder, I have added the method for params for all the
> APIs.
> >
> > Using different HTTP methods to correspond to different actions on the
> same API endpoint is a common practice in RESTful design. This approach is
> in line with the principles of REST, where the resource path remains
> consistent while the HTTP method indicates the intended operation. And I
> have observed a similar design pattern in Apache Kyuubi Project.
> >
> >
> >
> > > 5. /${version}/conf/dynamic uses three parameters without any POJO
> parameter type class, but /${version}/workers/events uses
> SendWorkerEventRequest as parameter type. Could this parameter type be
> unified to POJO?
> >
> > The api `/${version}/conf/dynamic` method is GET, request parameters
> should be fine.
> >
> >
> > Thanks,
> > Fei Wang
> >
> > On 2024/06/25 17:22:51 Nicholas wrote:
> > >  Hi turboFei,
> > >
> > >
> > >
> > >
> > > Thanks for driving the proposal of RESTful API Refine. I have some
> questions about this proposal:
> > >
> > >
> > >
> > >
> > > 1. Could you summary all /api/v1 interfaces in Public Interfaces
> section? Meanwhile, could you also the definition of the parameter and
> return type class like the fields of POJO?
> > >
> > >
> > >
> > >
> > > 2. Could some interfaces merged into one interface like
>

Re: [VOTE] Release Apache Celeborn 0.5.0-rc3

2024-06-22 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- signatures are good.
```
gpg --import KEYS
gpg --verify apache-celeborn-0.5.0-bin.tgz.asc
gpg --verify apache-celeborn-0.5.0-source.tgz.asc
```
- checksums are good.
```
sha512sum --check apache-celeborn-0.5.0-bin.tgz.sha512
sha512sum --check apache-celeborn-0.5.0-source.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- build success from source code (macOS).
```
./build/make-distribution.sh --sbt-enabled --release
```

BTW, thanks for the rich tests!

Regards,
Keyong Zhou

Ethan Feng  于2024年6月19日周三 12:47写道：

> Hello, Celeborn community,
>
> This is a call for a vote to release Apache Celeborn
> 0.5.0-rc3
>
> The git tag to be voted upon:
> https://github.com/apache/celeborn/releases/tag/v0.5.0-rc3
>
> Source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/celeborn/v0.5.0-rc3
>
> The git commit hash:
> 048ef207359113247bff05dcc203c70021ccfa10
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1076/
>
> The fingerprint of the PGP key release artifacts is signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> * Download links, checksums, and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in the source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> There are additional tests:
> * Performance test no regression
> 1 TB TPC-DS, 0.5.0 VS 0.4.1 : 2042(s) VS 2050(s)
> 1.1 TB pure shuffle, 0.5.0 VS 0.4.1 : 11.8min vs 11.8min
>
> * Result correctness test passed
> 1TB TPC-DS runs concurrently, the results are identical.
>
> * Usability test passed
> Rolling upgrade from version 0.4.1 to 0.5.0 succeed.
> The metrics system works as expected.
>
> * Stability test passed
> Random worker failures, Celeborn works as expected.
> Random master failures, Celeborn works as expected.
> Master meta corrupted, Celeborn works as expected.
>
> * Compatibility test passed
> The Celeborn server version of 0.5.0 works fine with the Celeborn client
> 0.4.1.
>
> * Grafana dashboard layout checked
>
>
> Regards,
> Ethan Feng
>

Re: Re:Re:[VOTE] CIP-6: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-15 Thread Keyong Zhou

+1 (binding)

Regards,
Keyong Zhou

Ethan Feng  于2024年6月14日周五 16:10写道：

> +1(binding)
>
> I think this CIP would bring performance benefits to Flink users.
>
> Thanks,
> Ethan
>
> Nicholas  于2024年6月14日周五 14:50写道：
> >
> > +1(non-binding). Sorry for mistake of my non-binding.
> >
> >
> >
> >
> > Regards,
> >
> > Nicholas Jiang
> >
> >
> >
> >
> > At 2024-06-14 14:38:42, "Nicholas"  wrote:
> >
> > +1(binding).
> > Regards,
> > Nicholas Jiang
> >
> >
> > At 2024-06-14 14:05:56, "Nicholas Jiang" 
> wrote:
> > >+1(non-binding)
> > >
> > >
> > >
> > >
> > >Regards,
> > >
> > >Nicholas Jiang
> > >
> > >
> > >
> > >
> > >At 2024-06-14 11:36:17, "Yuxin Tan"  wrote:
> > >>Hi all,
> > >>
> > >>Thanks for all the feedback about the CIP-6: Support Flink
> > >>hybrid shuffle integration with Apache Celeborn[1].
> > >>The discussion thread is here [2].
> > >>
> > >>I'd like to start a vote for it. The vote will be open for at least
> > >>72 hours unless there is an objection or insufficient votes.
> > >>
> > >>[1]
> > >>
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn
> > >>[2] https://lists.apache.org/thread/55mwmfsxwprzf5l80so9t2cpny82l4nx
> > >>
> > >>Best,
> > >>Yuxin
>

Re: Re: [VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Keyong Zhou

+1

Thanks for the proposal!

Regards,
Keyong Zhou

Nicholas Jiang  于2024年6月12日周三 13:02写道：

> +1. Looking forward to Celeborn CLI.
>
>
>
>
> Regards,
>
> Nicholas Jiang
>
>
> At 2024-06-12 12:26:34, "Aravind Patnam"  wrote:
> >Hi all,
> >
> >Sorry, this is the correct link to the Celeborn CLI CIP
> ><
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI>
> >.
> >
> >Thanks,
> >Aravind
> >
> >On Tue, Jun 11, 2024 at 9:24 PM Aravind Patnam 
> wrote:
> >
> >> Hi all,
> >>
> >> This is a call to vote to contribute the Celeborn CLI CIP
> >> <
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals>
> to
> >> Apache Celeborn.
> >>
> >> Please do vote accordingly:
> >> [ ] +1 approve
> >> [ ] +0 no opinion
> >> [ ] -1 disapprove (and the reason)
> >>
> >> Thanks once again!!
> >>
> >> Aravind
> >>
> >
> >
> >--
> >Aravind K. Patnam
>

Re: [DISCUSSION] CIP-6: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-07 Thread Keyong Zhou

Hi Yuxin and Xintong,

Really excited to see Flink and Celeborn communities collaborate
more on shuffle component! I believe this will inspire more for both sides
:)

+1 for this proposal, looking forward to see this feature to make progress.

Also I'm very interested in integrating Flink Hybrid Shuffle with Celeborn's
Reduce Partition as mentioned in the doc in the future, which I believe will
benefit more for very large shuffle operators :)

Regards,
Keyong Zhou

Nicholas Jiang  于2024年6月6日周四 13:25写道：

> Hi Yuxin,
>
> Thanks for driving this CIP about integration with Hybrid Shuffle. I have
> some comments on this CIP:
>
> 1. Could you describe in detail what functions the relevant components
> mentioned in Proposed Changes, including CelebornProducerAgent,
> CelebornConsumerAgent, CelebornMasterAgent, etc., support? In the design
> document, these components are only mentioned and no any details of changes.
>
> 2. Can you briefly introduce how to guarantee compatibility with
> Celeborn’s existing features such as partition splitting? IMO, the
> compatibility introduction should be mentioned in Proposed Changes to help
> community developers understand.
>
> 3. There are no changes on public interfaces. Is there any public
> configuration of integration with Hybrid Shuffle and Flink client?
>
> 4. The server side must store Segment information for each subpartition.
> How does the server side guarantee the accuracy and recoverability of
> Segment information?
>
> 5. Should Celeborn wait until FLIP-459 is released before releasing this
> integration? Which Flink version will release FLIP-459?
>
> Regards,
> Nicholas Jiang
>
> On 2024/05/28 12:51:32 Yuxin Tan wrote:
> > Hi all,
> >
> > I would like to start a discussion on CIP-6 Support Flink hybrid shuffle
> > integration with Apache
> > Celeborn[1]. Celeborn provides a stable, performant, scalable remote
> > shuffle service.
> > Concurrently, Flink hybrid shuffle supports transitions between memory,
> > disk, and remote
> > storage to improve performance and job stability. This integration
> proposal
> > is to harness the
> > benefits from both Celeborn and hybrid shuffle simultaneously.
> >
> > Note that this proposal has two parts.
> > 1. The Celeborn-side changes are in CIP-6[1].
> > 2. The Flink-side modifications are in FLIP-459[2].
> >
> > Looking forward to everyone's feedback and suggestions. Thank you!
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn
> >
> > Best,
> > Yuxin
> >
>

Re: [DISCUSS] Celeborn CLI Proposal

2024-06-07 Thread Keyong Zhou

Hi Aravind,

Thanks for the proposal! The proposal LGTM, I think it's very valuable.

Regards,
Keyong Zhou

Aravind Patnam  于2024年6月7日周五 12:47写道：

> Hi,
>
> Thanks Nicholas for the comments!
>
> I now got access to put the proposal in Confluence in the form of CIP, here
> <https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI
> >
> it is.
>
> Regarding your questions:
>
> > 1. From a user's perspective, the CLI is more used for some maintenance
> operations such as online and offline of server, rescaling of cluster etc,
> not only based on the REST API. What CLI interfaces are there that the REST
> API doesn’t have for maintenance?
> This is highly dependent on what the user is leveraging to manage their
> cluster. For example, in k8s, you would be using k8s APIs to achieve this.
> We can probably add a generic interface API for it that provides basic
> operations that users can implement themselves for their cluster management
> logic based on what cluster managers they are using. Although, I think this
> will likely be a later evolution of the CLI, once basic REST API operations
> are implemented in the CLI. WDYT?
>
> > 2. There are same sub-commands between MASTER and WORKER. Why not these
> sub-commands belong to BOTH?
> Agreed - this was a formatting mistake. I fixed it now, thanks for pointing
> that out.
>
> > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI works
> well no matter the server is alive.
> Yes, I agree. I think for this we would have to talk to the cluster
> manager, similar to my response to #1. We would have to query the specific
> cluster manager to get details if the Celeborn servers are dead, since the
> Celeborn REST API would not work then. We can add a generic API that users
> can implement based on their own environment.
>
> Thanks,
> Aravind
>
>
>
> On Wed, Jun 5, 2024 at 10:43 PM Nicholas Jiang 
> wrote:
>
> > Hi Aravind,
> >
> > Thanks for driving this CIP about Celeborn CLI. I have some comments on
> > this CIP:
> >
> > 1. From a user's perspective, the CLI is more used for some maintenance
> > operations such as online and offline of server, rescaling of cluster
> etc,
> > not only based on the REST API. What CLI interfaces are there that the
> REST
> > API doesn’t have for maintenance?
> >
> > 2. There are same sub-commands between MASTER and WORKER. Why not these
> > sub-commands belong to BOTH?
> >
> > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI works
> > well no matter the server is alive.
> >
> > BTW, could this design doc of proposal follow the template of CIP[1]?
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> >
> > Regards,
> > Nicholas Jiang
> >
> > On 2024/06/05 23:33:02 Aravind Patnam wrote:
> > > Hi all,
> > >
> > > I have written up a proposal about introducing a CLI for Celeborn. You
> > can
> > > find the proposal
> > > <
> >
> https://docs.google.com/document/d/1j9wKFSR_ychYDF0NU5YN67WCCtNAgYTbN5CN8V3SOnk/edit?usp=sharing
> > >
> > > here.
> > > Please let me know if you have any comments or questions.
> > >
> > > TLDR by introducing a CLI, it would complement the existing dashboard
> and
> > > would benefit us internally. We rely on CLI tools internally a lot for
> > > automation and other operations.
> > >
> > > FYI, I was not able to access the cwiki page to put this proposal
> there,
> > > there seems to be some permissions issue. Hope it is okay to just share
> > as
> > > a google doc here for now.
> > >
> > > --
> > > Aravind K. Patnam
> > >
> > >  Apache Celeborn CLI Proposal
> > > <
> >
> https://docs.google.com/document/d/1j9wKFSR_ychYDF0NU5YN67WCCtNAgYTbN5CN8V3SOnk/edit?usp=drive_web
> > >
> > >
> >
>
>
> --
> Aravind K. Patnam
>

Re: [VOTE] Release Apache Celeborn 0.5.0-rc0

2024-06-07 Thread Keyong Zhou

Hi Ethan,

Thanks for the effort! Although others LGTM, I found some issues in the
LICENSE file:

1. ratis-metrics-default,
ap-loader-all, jersey-server, jersey-container-servlet-core,
jersey-hk2, jersey-media-json-jackson, jersey-media-multipart
are missing in the LICENSE file
2. javax.servlet is deleted, so it should also be removed from the
LICENSE-binary file

Regards,
Keyong Zhou

Ethan Feng  于2024年6月7日周五 15:56写道：

> Hello, Celeborn community,
>
> This is a call for a vote to release Apache Celeborn
> 0.5.0-rc0
>
> The git tag to be voted upon:
> https://github.com/apache/celeborn/releases/tag/v0.5.0-rc0
>
> Source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/celeborn/v0.5.0-rc0
>
> The git commit hash:
> 36567733aace2c83e533cbefcc5cd374ca935c76
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1066/
>
> The fingerprint of the PGP key release artifacts is signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> * Download links, checksums, and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in the source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> There are additional tests:
> * Performance test no regression
> 1 TB TPC-DS, 0.5.0 VS 0.4.1 : 2042(s) VS 2050(s)
> 1.1 TB pure shuffle, 0.5.0 VS 0.4.1 : 11.8min vs 11.8min
>
> * Result correctness test passed
> 1TB TPC-DS runs concurrently, the results are identical.
>
> * Usability test passed
> Rolling upgrade from version 0.4.1 to 0.5.0 succeed.
> The metrics system works as expected.
>
> * Stability test passed
> Random worker failures, Celeborn works as expected.
> Random master failures, Celeborn works as expected.
> Master meta corrupted, Celeborn works as expected.
>
> * Compatibility test passed
> The Celeborn server version of 0.5.0 works fine with the Celeborn client
> 0.4.1.
>
>
> Regards,
> Ethan Feng
>

Re: [DRAFT] Celeborn Board Report

2024-06-07 Thread keyong zhou

Thanks Yu for reviewing, I'm going to submit the report today.

Regards,
Keyong Zhou

Yu Li  于2024年6月5日周三 12:53写道：

> +1. Thanks for compiling this, Keyong!
>
> Best Regards,
> Yu
>
> On Tue, 4 Jun 2024 at 09:42, Keyong Zhou  wrote:
> >
> > Hi community,
> >
> > The board report is due on June 12th, following is the draft I made, any
> > comments
> > will be appreciated, thanks!
> >
> > ## Description:
> > The mission of Apache Celeborn is the creation and maintenance of
> software
> > related to an intermediate data service for big data computing engines to
> > boost
> > performance, stability, and flexibility
> >
> > ## Project Status:
> > Current project status: New
> > Issues for the board: None
> >
> > ## Membership Data:
> > Apache Celeborn was founded 2024-03-20 (3 months ago).
> > There are currently 21 committers and 13 PMC members in this project.
> > The Committer-to-PMC ratio is roughly 3:2.
> >
> > Community changes, past quarter:
> >
> > - No new PMC members (project graduated recently).
> > - Chandni Singh was added as committer on 2024-03-21.
> > - Mridul Muralidharan was added as committer on 2024-04-29.
> >
> > ## Project Activity:
> > Software development activity:
> >
> >  - We released 0.4.1 on May 28th.
> >  - We are in the process of releasing 0.5.0.
> >  - Memory file storage is merged to main.
> >  - An optimized AQE support is under review.
> >
> > Meetups and Conferences:
> >
> >  - A talk was given in Apache Local Conference Hangzhou on May 28th.
> >
> > Recent releases:
> >
> > - 0.4.1 was released on May 22th, 2024.
> > - 0.4.0-incubating was released on February 6th, 2024.
> >
> > ## Community Health:
> > Overall community health is good. In the past quarter, dev/users mail
> list
> > had 11%/450% increase in traffic respectively.
> > The issues maillist had a 17% decrease in traffic, the PMCs treat it as
> > normal fluctuation. We have been performing extensive
> > outreach for our users, and encouraging them to contribute back to the
> > project. Also, we are active in making a voice in various
> > conferences to attract more users.
> >
> > Regards,
> > Keyong Zhou
>

[DRAFT] Celeborn Board Report

2024-06-03 Thread Keyong Zhou

Hi community,

The board report is due on June 12th, following is the draft I made, any
comments
will be appreciated, thanks!

## Description:
The mission of Apache Celeborn is the creation and maintenance of software
related to an intermediate data service for big data computing engines to
boost
performance, stability, and flexibility

## Project Status:
Current project status: New
Issues for the board: None

## Membership Data:
Apache Celeborn was founded 2024-03-20 (3 months ago).
There are currently 21 committers and 13 PMC members in this project.
The Committer-to-PMC ratio is roughly 3:2.

Community changes, past quarter:

- No new PMC members (project graduated recently).
- Chandni Singh was added as committer on 2024-03-21.
- Mridul Muralidharan was added as committer on 2024-04-29.

## Project Activity:
Software development activity:

 - We released 0.4.1 on May 28th.
 - We are in the process of releasing 0.5.0.
 - Memory file storage is merged to main.
 - An optimized AQE support is under review.

Meetups and Conferences:

 - A talk was given in Apache Local Conference Hangzhou on May 28th.

Recent releases:

- 0.4.1 was released on May 22th, 2024.
- 0.4.0-incubating was released on February 6th, 2024.

## Community Health:
Overall community health is good. In the past quarter, dev/users mail list
had 11%/450% increase in traffic respectively.
The issues maillist had a 17% decrease in traffic, the PMCs treat it as
normal fluctuation. We have been performing extensive
outreach for our users, and encouraging them to contribute back to the
project. Also, we are active in making a voice in various
conferences to attract more users.

Regards,
Keyong Zhou

Re: [Discussion] Proposal Management in Celeborn Community

2024-05-29 Thread Keyong Zhou

+1 for me.

About the comments by Cheng, IMHO discussing in maillist is also acceptable
(and even better)

Regards,
Keyong Zhou

Cheng Pan  于2024年5月29日周三 14:32写道：

> +1 for archiving proposals on confluence.
>
> Does Confluence support inline comments like Google Docs does? I think
> it’s a convincing functionality for the discussion period.
>
> Thanks,
> Cheng Pan
>
>
> > On May 29, 2024, at 11:19, rexxiong  wrote:
> >
> > Hello, Celeborn community,
> >
> > In the past, when Celeborn introduced new major features or significant
> changes, we typically used Google Docs to launch proposals. However, a
> major issue with Google Docs is the difficulty in centrally managing these
> proposals. Therefore, after referring to other communities and based on
> discussions with several PMCs offline, it appears that Apache Confluence
> could be a viable alternative for our needs. With that in mind, I would
> like to invite all of you to share your thoughts, experiences, and
> preferences regarding the use of Apache Confluence versus Google Docs for
> our proposal management. Your feedback will be invaluable in helping us
> make an informed decision that best meets the needs of our community.
> >
> > Meanwhile, I have archived previous proposals and written the Celeborn
> Improvement Proposal (CIP) process on Confluence.
> >
> > What do you think? Looking forward to your thoughts on this proposal.
> >
> >
> > Thanks,
> > Jiashu Xiong
>
>

Re: [DISCUSS] Time for 0.5.0

2024-05-24 Thread Keyong Zhou

+1 for releasing 0.5.0. But I think memory file storage is still
experimental.

Regards,
Keyong Zhou

Ethan Feng  于2024年5月24日周五 18:15写道：

> Hello, Celeborn community,
>
> It has been 4 months since we released the last major version. Some
> new features, such as SSL support and memory file storage, are now
> ready. Several optimizations have been merged into the main branch.
> Many components are updated to the latest version.
>
> What do you think? I'm volunteering to be the release manager if no
> one else has applied.
>
> Thanks,
> Ethan Feng
>

Re: [VOTE] Release Apache Celeborn 0.4.1-rc1

2024-05-21 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- signatures are good.
```
gpg --import KEYS
gpg --verify apache-celeborn-0.4.1-bin.tgz.asc
gpg --verify apache-celeborn-0.4.1-source.tgz.asc
```
- checksums are good.
```
sha512sum --check apache-celeborn-0.4.1-bin.tgz.sha512
sha512sum --check apache-celeborn-0.4.1-source.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- build success from source code (macOS).
```
./build/make-distribution.sh --sbt-enabled --release
```

I also tested performance, no regression:
1T TPCDS, 0.4.0 vs. 0.4.1: 2136s vs. 2127s
734G pure shuffle, 0.4.0 vs. 0.4.1: 10.3min vs. 10.3min

Regards,
Keyong Zhou

Cheng Pan  于2024年5月21日周二 14:16写道：

> +1
>
> I have rolled out this version to a small cluster for several days,
> everything goes well so far.
>
> I checked the
> org.apache.celeborn:celeborn-client-spark-3-shaded_2.12:0.4.1, it does not
> pull transitive deps now. While there might be some issues with the
> relocation of Guava classes, specifically, package
> `com.google.thirdparty.publicsuffix` still be there.
>
> Thanks,
> Cheng Pan
>
>
> > On May 14, 2024, at 12:13, Nicholas Jiang 
> wrote:
> >
> > Hi Celeborn community,
> >
> > This is a call for a vote to release Apache Celeborn
> >
> > 0.4.1-rc1
> >
> >
> > The git tag to be voted upon:
> >
> > https://github.com/apache/celeborn/releases/tag/v0.4.1-rc1
> >
> > The git commit hash:
> > 641180142c5ef36430a6afcd702c9487a6007458 source and binary artifacts can
> be
> > found at:
> >
> > https://dist.apache.org/repos/dist/dev/celeborn/v0.4.1-rc1
> >
> > The staging repo:
> >
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1055
> >
> >
> > Fingerprint of the PGP key release artifacts are signed with:
> > D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> >
> > My public key to verify signatures can be found in:
> >
> > https://dist.apache.org/repos/dist/release/celeborn/KEYS
> >
> > The vote will be open for at least 72 hours or until the necessary
> > number of votes are reached.
> >
> > Please vote accordingly:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove (and the reason)
> >
> > Steps to validate the release:
> >
> > https://www.apache.org/info/verification.html
> >
> > * Download links, checksums and PGP signatures are valid.
> > * Source code distributions have correct names matching the current
> release.
> > * LICENSE and NOTICE files are correct.
> > * All files have license headers if necessary.
> > * No unlicensed compiled archives bundled in source archive.
> > * The source tarball matches the git tag.
> > * Build from source is successful.
> >
> > Regards,
> > Nicholas Jiang
>
>
>

Re: Voice Of Apache interview request

2024-05-15 Thread Keyong Zhou

I'm in China (GMT + 8), 12 hours earlier than your zone. 9:00 -10:00 a.m.
or 9:00 -10:00 p.m is good for me.

BTW, could you send the questions ahead of time so that I can prepare for
it?

Regards,
Keyong Zhou

Bowen, Rich  于2024年5月15日周三 23:34写道：

> I do interviews over Google Meet, so that I can record both sides of the
> conversation. So we just need to find a time. What time zone are you in?
> I’m in Eastern US (New York) time (UTC -4).
>
> —Rich
>
> > On May 3, 2024, at 2:47 AM, Keyong Zhou  wrote:
> >
> > Hi Rich,
> >
> > Thanks for reaching out! I'd like to be the volunteer. So, what do I
> need to do?
> >
> > Regards,
> > Keyong Zhou
> >
> > Rich Bowen mailto:rbo...@apache.org>> 于2024年4月30日周二
> 22:24写道：
> >> Congratulations on graduating and becoming a top level project at the
> Apache Software Foundation.
> >>
> >> As you may know, I produce a podcast about Apache projects, at
> https://feathercast.apache.org/  I'd like to do one about your project,
> and am looking for a volunteer who would be willing to speak for 10-15
> minutes about the project, what it does, who uses it, where the name came
> from, and plans for the future. Preferably a PMC member.
> >>
> >> Please let me know if you're interested in speaking with me on these
> topics. (Please copy me directly on any responses - rbo...@apache.org
> <mailto:rbo...@apache.org> - as I am not currently subscribed to this
> list.)
>
>

Re: Re: [VOTE] Release Apache Celeborn 0.4.1-rc0

2024-05-08 Thread Keyong Zhou

Thanks Nichalos for volunteering for the test!

Regards,
Keyong Zhou

Nicholas Jiang  于2024年5月8日周三 17:34写道：

> Hi Keyong,
>
>
>
>
> If no one takes the third test, I perhaps take random killing test via
> chaos testing framework of celeborn in internal testing environment.
>
>
>
>
> Regard,
>
> Nicholas Jiang
>
>
>
>
> At 2024-05-07 17:09:24, "Keyong Zhou"  wrote:
> >Hi Nicholas，
> >
> >Thanks for the work! I think we need to test the following scenarios
> before
> >publishing:
> >1. compatibility test
> >2. perf test: i.e. TPCDS, pure shuffle workload
> >3. random killing test
> >
> >I'll take the perf test, anyone take the other two?
> >
> >Regards,
> >Keyong Zhou
> >
> >
> >Nicholas Jiang  于2024年5月7日周二 17:04写道：
> >
> >> Hi Celeborn community,
> >>
> >> This is a call for a vote to release Apache Celeborn
> >>
> >> 0.4.1-rc0
> >>
> >>
> >> The git tag to be voted upon:
> >>
> >> https://github.com/apache/celeborn/releases/tag/v0.4.1-rc0
> >>
> >> The git commit hash:
> >> 6118a549062cd6cda12947679485c98b2e8943a8 source and binary artifacts
> can be
> >> found at:
> >>
> >> https://dist.apache.org/repos/dist/dev/celeborn/v0.4.1-rc0
> >>
> >> The staging repo:
> >>
> >>
> https://repository.apache.org/content/repositories/orgapacheceleborn-1054
> >>
> >>
> >> Fingerprint of the PGP key release artifacts are signed with:
> >> D73CADC1DAB63BD3C770BB6D9476842D24B7C885
> >>
> >> My public key to verify signatures can be found in:
> >>
> >> https://dist.apache.org/repos/dist/release/celeborn/KEYS
> >>
> >> The vote will be open for at least 72 hours or until the necessary
> >> number of votes are reached.
> >>
> >> Please vote accordingly:
> >>
> >> [ ] +1 approve
> >> [ ] +0 no opinion
> >> [ ] -1 disapprove (and the reason)
> >>
> >> Steps to validate the release:
> >>
> >> https://www.apache.org/info/verification.html
> >>
> >> * Download links, checksums and PGP signatures are valid.
> >> * Source code distributions have correct names matching the current
> >> release.
> >> * LICENSE and NOTICE files are correct.
> >> * All files have license headers if necessary.
> >> * No unlicensed compiled archives bundled in source archive.
> >> * The source tarball matches the git tag.
> >> * Build from source is successful.
> >>
> >> Regards,
> >> Nicholas Jiang
>

Re: [DRAFT] Celeborn Board Report

2024-05-07 Thread Keyong Zhou

Thanks for all the feedback! I will submit the report today.

Regards,
Keyong Zhou

Yu Li  于2024年5月8日周三 08:17写道：

> LGTM. Thanks for the efforts, Keyong.
>
> Best Regards,
> Yu
>
>
> On Tue, May 7, 2024 at 9:30 AM Keyong Zhou  wrote:
>
> > Thanks Yu for your comments, based on your suggestion,
> > I will restate the section as follows:
> >
> > In addition to encourage more discussions on mailing lists, we will
> > also synchronize important information/conclusions to the mailing
> > lists if they happen elsewhere.
> >
> > What do you think?
> >
> > Regards,
> > Keyong Zhou
> >
> >
> >
> > Yu Li  于2024年5月6日周一 18:38写道：
> >
> > > Thanks for drafting the report, Keyong.
> > >
> > > While other parts LGTM, the below section draws my attention:
> > > >> We expect issues traffic to be steady, but there may be fluctuation
> > for
> > > >> dev/users traffic because many of the discussions happen
> > > >> in slack/wechat/dingtalk. We are encouraging more discussion to
> happen
> > > in
> > > >> maillists.
> > >
> > > As "if it didn’t happen on a mailing list, it didn’t happen" is one of
> > > the ASF Mottos [1], in addition to encouraging more discussions on
> > > mailing list, we also need to make sure important
> > > information/conclusions drawn through other channels are synchronized
> > > to the mailing list, especially the decision-making ones. In other
> > > words, we should ensure the mailing list is the most active and the
> > > official channels for making decisions.
> > >
> > > Best Regards,
> > > Yu
> > >
> > > [1]
> > >
> >
> https://community.apache.org/newbiefaq.html#NewbieFAQ-IsthereaCodeofConductforApacheprojects
> > >
> > > On Sat, 4 May 2024 at 20:31, Mridul Muralidharan 
> > wrote:
> > > >
> > > > Ah ! Then it makes sense to not include it :-)
> > > > Thanks for clarifying !
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > >
> > > > On Sat, May 4, 2024 at 4:15 AM Keyong Zhou 
> wrote:
> > > >
> > > > > Actually it's the second one. For the first one I didn't send the
> > draft
> > > > > to dev maillist for discussion because of lack of experience...
> > > > >
> > > > > Regards,
> > > > > Keyong Zhou
> > > > >
> > > > > Mridul Muralidharan  于2024年5月3日周五 23:38写道：
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > >   I meant call it out as part of the board report, so that it is
> > > captured
> > > > > > in our updates to board.
> > > > > >
> > > > > > This is the first update post TLP, right ?
> > > > > >
> > > > > > Regards,
> > > > > > Mridul
> > > > > >
> > > > > > On Fri, May 3, 2024 at 1:41 AM Keyong Zhou 
> > > wrote:
> > > > > >
> > > > > > > Hi Mridul,
> > > > > > >
> > > > > > > The news is posted in the following links:
> > > > > > >
> > > > > > > Apache.org:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
> > > > > > >
> > > > > > > Newswire:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
> > > > > > >
> > > > > > > X (Twitter):
> > https://twitter.com/TheASF/status/1782756834450801037
> > > > > > >
> > > > > > > LinkedIn:
> > > > > > >
> > > > >
> > >
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
> > > > > > >
> > > > > > > Besides, we also posted a blog here (in Chinese :D) :
> > > > > > > https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> > > > > > > <

Re: [VOTE] Release Apache Celeborn 0.4.1-rc0

2024-05-07 Thread Keyong Zhou

Hi Nicholas，

Thanks for the work! I think we need to test the following scenarios before
publishing:
1. compatibility test
2. perf test: i.e. TPCDS, pure shuffle workload
3. random killing test

I'll take the perf test, anyone take the other two?

Regards,
Keyong Zhou


Nicholas Jiang  于2024年5月7日周二 17:04写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn
>
> 0.4.1-rc0
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/celeborn/releases/tag/v0.4.1-rc0
>
> The git commit hash:
> 6118a549062cd6cda12947679485c98b2e8943a8 source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/celeborn/v0.4.1-rc0
>
> The staging repo:
>
> https://repository.apache.org/content/repositories/orgapacheceleborn-1054
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> D73CADC1DAB63BD3C770BB6D9476842D24B7C885
>
> My public key to verify signatures can be found in:
>
> https://dist.apache.org/repos/dist/release/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Steps to validate the release:
>
> https://www.apache.org/info/verification.html
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Regards,
> Nicholas Jiang

Re: [DRAFT] Celeborn Board Report

2024-05-06 Thread Keyong Zhou

Thanks Yu for your comments, based on your suggestion,
I will restate the section as follows:

In addition to encourage more discussions on mailing lists, we will
also synchronize important information/conclusions to the mailing
lists if they happen elsewhere.

What do you think?

Regards,
Keyong Zhou



Yu Li  于2024年5月6日周一 18:38写道：

> Thanks for drafting the report, Keyong.
>
> While other parts LGTM, the below section draws my attention:
> >> We expect issues traffic to be steady, but there may be fluctuation for
> >> dev/users traffic because many of the discussions happen
> >> in slack/wechat/dingtalk. We are encouraging more discussion to happen
> in
> >> maillists.
>
> As "if it didn’t happen on a mailing list, it didn’t happen" is one of
> the ASF Mottos [1], in addition to encouraging more discussions on
> mailing list, we also need to make sure important
> information/conclusions drawn through other channels are synchronized
> to the mailing list, especially the decision-making ones. In other
> words, we should ensure the mailing list is the most active and the
> official channels for making decisions.
>
> Best Regards,
> Yu
>
> [1]
> https://community.apache.org/newbiefaq.html#NewbieFAQ-IsthereaCodeofConductforApacheprojects
>
> On Sat, 4 May 2024 at 20:31, Mridul Muralidharan  wrote:
> >
> > Ah ! Then it makes sense to not include it :-)
> > Thanks for clarifying !
> >
> > Regards,
> > Mridul
> >
> >
> > On Sat, May 4, 2024 at 4:15 AM Keyong Zhou  wrote:
> >
> > > Actually it's the second one. For the first one I didn't send the draft
> > > to dev maillist for discussion because of lack of experience...
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Mridul Muralidharan  于2024年5月3日周五 23:38写道：
> > >
> > > > Hi,
> > > >
> > > >   I meant call it out as part of the board report, so that it is
> captured
> > > > in our updates to board.
> > > >
> > > > This is the first update post TLP, right ?
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > > On Fri, May 3, 2024 at 1:41 AM Keyong Zhou 
> wrote:
> > > >
> > > > > Hi Mridul,
> > > > >
> > > > > The news is posted in the following links:
> > > > >
> > > > > Apache.org:
> > > > >
> > > > >
> > > >
> > >
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
> > > > >
> > > > > Newswire:
> > > > >
> > > > >
> > > >
> > >
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
> > > > >
> > > > > X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
> > > > >
> > > > > LinkedIn:
> > > > >
> > >
> https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
> > > > >
> > > > > Besides, we also posted a blog here (in Chinese :D) :
> > > > > https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> > > > > <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
> > > > >
> > > > > It'll be great if we can call out louder, do you have any idea? : )
> > > > >
> > > > > Regards,
> > > > > Keyong Zhou
> > > > >
> > > > > Mridul Muralidharan  于2024年5月3日周五 07:40写道：
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > >   Do we want to call out graduation to TLP ?
> > > > > >
> > > > > > Regards,
> > > > > > Mridul
> > > > > >
> > > > > > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou 
> > > wrote:
> > > > > >
> > > > > > > Hi community,
> > > > > > >
> > > > > > > The board report is due on May 8th, following is the draft I
> made,
> > > > any
> > > > > > > comments
> > > > > > > will be appreciated, thanks!
> > > > > > >
> > > > > > > ## Description:
> > > > > > > The mission of Apache Celeborn is the creation and maintenance
> of
> > > > > > software
&g

Re: [DRAFT] Celeborn Board Report

2024-05-04 Thread Keyong Zhou

Actually it's the second one. For the first one I didn't send the draft
to dev maillist for discussion because of lack of experience...

Regards,
Keyong Zhou

Mridul Muralidharan  于2024年5月3日周五 23:38写道：

> Hi,
>
>   I meant call it out as part of the board report, so that it is captured
> in our updates to board.
>
> This is the first update post TLP, right ?
>
> Regards,
> Mridul
>
> On Fri, May 3, 2024 at 1:41 AM Keyong Zhou  wrote:
>
> > Hi Mridul,
> >
> > The news is posted in the following links:
> >
> > Apache.org:
> >
> >
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
> >
> > Newswire:
> >
> >
> https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html
> >
> > X (Twitter): https://twitter.com/TheASF/status/1782756834450801037
> >
> > LinkedIn:
> > https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321
> >
> > Besides, we also posted a blog here (in Chinese :D) :
> > https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
> > <https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>
> >
> > It'll be great if we can call out louder, do you have any idea? : )
> >
> > Regards,
> > Keyong Zhou
> >
> > Mridul Muralidharan  于2024年5月3日周五 07:40写道：
> >
> > > Hi,
> > >
> > >   Do we want to call out graduation to TLP ?
> > >
> > > Regards,
> > > Mridul
> > >
> > > On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:
> > >
> > > > Hi community,
> > > >
> > > > The board report is due on May 8th, following is the draft I made,
> any
> > > > comments
> > > > will be appreciated, thanks!
> > > >
> > > > ## Description:
> > > > The mission of Apache Celeborn is the creation and maintenance of
> > > software
> > > > related to an intermediate data service for big data computing
> engines
> > to
> > > > boost
> > > > performance, stability, and flexibility
> > > >
> > > > ## Project Status:
> > > > Current project status: New
> > > > Issues for the board: None
> > > >
> > > > ## Membership Data:
> > > > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > > > There are currently 21 committers and 13 PMC members in this project.
> > > > The Committer-to-PMC ratio is roughly 3:2.
> > > >
> > > > Community changes, past quarter:
> > > >
> > > > - No new PMC members (project graduated recently).
> > > > - Chandni Singh was added as committer on 2024-03-21.
> > > > - Mridul Muralidharan was added as committer on 2024-04-29.
> > > >
> > > > ## Project Activity:
> > > > Software development activity:
> > > >
> > > >  - We are preparing to release 0.4.1 in May.
> > > >  - We are preparing to release 0.5.0 in May.
> > > >  - Security support (authentication and SSL) has been merged.
> > > >  - Memory storage is close to being merged.
> > > >
> > > > Meetups and Conferences:
> > > >
> > > >  - An online meetup was held on April 16th with some developers.
> > > >  - An online meetup was held on April 25th with some users.
> > > >
> > > > Recent releases:
> > > >
> > > > - 0.4.0-incubating was released on 2024-02-06.
> > > > - 0.3.2-incubating was released on 2024-01-08.
> > > >
> > > > ## Community Health:
> > > > Overall community health is good. In the past quarter,
> dev/issues/users
> > > > mail list had 6%/11%/1100% increase in traffic respectively.
> > > > We expect issues traffic to be steady, but there may be fluctuation
> for
> > > > dev/users traffic because many of the discussions happen
> > > > in slack/wechat/dingtalk. We are encouraging more discussion to
> happen
> > in
> > > > maillists.
> > > >
> > > > We have been performing extensive outreach for our users, and
> > encouraging
> > > > them to contribute back to the project. Also, we are
> > > > active in making a voice in various conferences to attract more
> users.
> > > >
> > > > Regards,
> > > > Keyong Zhou
> > > >
> > >
> >
>

Re: Voice Of Apache interview request

2024-05-03 Thread Keyong Zhou

Hi Rich,

Thanks for reaching out! I'd like to be the volunteer. So, what do I need
to do?

Regards,
Keyong Zhou

Rich Bowen  于2024年4月30日周二 22:24写道：

> Congratulations on graduating and becoming a top level project at the
> Apache Software Foundation.
>
> As you may know, I produce a podcast about Apache projects, at
> https://feathercast.apache.org/  I'd like to do one about your project,
> and am looking for a volunteer who would be willing to speak for 10-15
> minutes about the project, what it does, who uses it, where the name came
> from, and plans for the future. Preferably a PMC member.
>
> Please let me know if you're interested in speaking with me on these
> topics. (Please copy me directly on any responses - rbo...@apache.org -
> as I am not currently subscribed to this list.)
>

Re: [DRAFT] Celeborn Board Report

2024-05-03 Thread Keyong Zhou

Hi Mridul,

The news is posted in the following links:

Apache.org:
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn

Newswire:
https://www.globenewswire.com/news-release/2024/04/23/2867699/0/en/Apache-Software-Foundation-Announces-New-Top-Level-Project-Apache-Celeborn.html

X (Twitter): https://twitter.com/TheASF/status/1782756834450801037

LinkedIn:
https://www.linkedin.com/feed/update/urn:li:activity:7188522508231352321

Besides, we also posted a blog here (in Chinese :D) :
https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw
<https://mp.weixin.qq.com/s/DdoJW-f3BZAvxciDbI3mTw>

It'll be great if we can call out louder, do you have any idea? : )

Regards,
Keyong Zhou

Mridul Muralidharan  于2024年5月3日周五 07:40写道：

> Hi,
>
>   Do we want to call out graduation to TLP ?
>
> Regards,
> Mridul
>
> On Thu, May 2, 2024 at 3:34 AM Keyong Zhou  wrote:
>
> > Hi community,
> >
> > The board report is due on May 8th, following is the draft I made, any
> > comments
> > will be appreciated, thanks!
> >
> > ## Description:
> > The mission of Apache Celeborn is the creation and maintenance of
> software
> > related to an intermediate data service for big data computing engines to
> > boost
> > performance, stability, and flexibility
> >
> > ## Project Status:
> > Current project status: New
> > Issues for the board: None
> >
> > ## Membership Data:
> > Apache Celeborn was founded 2024-03-20 (2 months ago).
> > There are currently 21 committers and 13 PMC members in this project.
> > The Committer-to-PMC ratio is roughly 3:2.
> >
> > Community changes, past quarter:
> >
> > - No new PMC members (project graduated recently).
> > - Chandni Singh was added as committer on 2024-03-21.
> > - Mridul Muralidharan was added as committer on 2024-04-29.
> >
> > ## Project Activity:
> > Software development activity:
> >
> >  - We are preparing to release 0.4.1 in May.
> >  - We are preparing to release 0.5.0 in May.
> >  - Security support (authentication and SSL) has been merged.
> >  - Memory storage is close to being merged.
> >
> > Meetups and Conferences:
> >
> >  - An online meetup was held on April 16th with some developers.
> >  - An online meetup was held on April 25th with some users.
> >
> > Recent releases:
> >
> > - 0.4.0-incubating was released on 2024-02-06.
> > - 0.3.2-incubating was released on 2024-01-08.
> >
> > ## Community Health:
> > Overall community health is good. In the past quarter, dev/issues/users
> > mail list had 6%/11%/1100% increase in traffic respectively.
> > We expect issues traffic to be steady, but there may be fluctuation for
> > dev/users traffic because many of the discussions happen
> > in slack/wechat/dingtalk. We are encouraging more discussion to happen in
> > maillists.
> >
> > We have been performing extensive outreach for our users, and encouraging
> > them to contribute back to the project. Also, we are
> > active in making a voice in various conferences to attract more users.
> >
> > Regards,
> > Keyong Zhou
> >
>

[DRAFT] Celeborn Board Report

2024-05-02 Thread Keyong Zhou

Hi community,

The board report is due on May 8th, following is the draft I made, any
comments
will be appreciated, thanks!

## Description:
The mission of Apache Celeborn is the creation and maintenance of software
related to an intermediate data service for big data computing engines to
boost
performance, stability, and flexibility

## Project Status:
Current project status: New
Issues for the board: None

## Membership Data:
Apache Celeborn was founded 2024-03-20 (2 months ago).
There are currently 21 committers and 13 PMC members in this project.
The Committer-to-PMC ratio is roughly 3:2.

Community changes, past quarter:

- No new PMC members (project graduated recently).
- Chandni Singh was added as committer on 2024-03-21.
- Mridul Muralidharan was added as committer on 2024-04-29.

## Project Activity:
Software development activity:

 - We are preparing to release 0.4.1 in May.
 - We are preparing to release 0.5.0 in May.
 - Security support (authentication and SSL) has been merged.
 - Memory storage is close to being merged.

Meetups and Conferences:

 - An online meetup was held on April 16th with some developers.
 - An online meetup was held on April 25th with some users.

Recent releases:

- 0.4.0-incubating was released on 2024-02-06.
- 0.3.2-incubating was released on 2024-01-08.

## Community Health:
Overall community health is good. In the past quarter, dev/issues/users
mail list had 6%/11%/1100% increase in traffic respectively.
We expect issues traffic to be steady, but there may be fluctuation for
dev/users traffic because many of the discussions happen
in slack/wechat/dingtalk. We are encouraging more discussion to happen in
maillists.

We have been performing extensive outreach for our users, and encouraging
them to contribute back to the project. Also, we are
active in making a voice in various conferences to attract more users.

Regards,
Keyong Zhou

[ANNOUNCE] Add Mridul Muralidharan as new committer

2024-04-28 Thread Keyong Zhou

Hi Celeborn Community,

The Project Management Committee (PMC) for Apache Celeborn
has invited Mridul Muralidharan to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A PMC member helps manage and guide the direction of the project.

Please join me in congratulating Mridul Muralidharan!

Regards,
Keyong Zhou

Re: [DISCUSS] Time for 0.4.1

2024-04-12 Thread Keyong Zhou

+1, thanks Nicholas for volunteering!

Regards,
Keyong Zhou

Shaoyun Chen  于2024年4月12日周五 22:03写道：

> +1
>
> Cheng Pan  于2024年4月12日周五 20:04写道：
> >
> > +1, we do need a patch release for 0.4
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Apr 12, 2024, at 19:59, Nicholas Jiang 
> wrote:
> > >
> > > Hey, Celeborn community,
> > >
> > >
> > > It has been a while since the 0.4.0 release, and there are some
> critical fixes land branch-0.4, for example, [CELEBORN-1252][FOLLOWUP] Fix
> Worker#computeResourceConsumption NullPointerException for
> userResourceConsumption that does not contain given userIdentifier. From my
> perspective, it’s time to prepare for releasing 0.4.1.
> > >
> > >
> > > WDYT? And I’m volunteering to be the release manager if no one has
> applied.
> > >
> > > Regards,
> > > Nicholas Jiang
> >
> >
>

[ANNOUNCE] Add Chandni Singh as new committer

2024-03-21 Thread Keyong Zhou

Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Chandni Singh to become a committer and we are pleased
to announce that she has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Chandni Singh!

Thanks,
Keyong Zhou

Re: [VOTE] Graduate Apache Celeborn (incubating) as a TLP - Community

2024-03-01 Thread keyong zhou

+1

Regards,
Keyong Zhou

Mridul Muralidharan  于2024年3月1日周五 19:58写道：

> +1
>
> Regards,
> Mridul
>
>
> On Fri, Mar 1, 2024 at 4:35 AM Nicholas  wrote:
>
> >
> > +1.
> >
> >
> > Regards,
> > Nicholas Jiang
> >
> >
> >
> >
> > --
> > 发自我的网易邮箱手机智能版
> > 
> >
> >
> > - Original Message -
> > From: "Yu Li" 
> > To: dev@celeborn.apache.org
> > Sent: Fri, 1 Mar 2024 16:52:10 +0800
> > Subject: [VOTE] Graduate Apache Celeborn (incubating) as a TLP -
> Community
> >
> > Hi All,
> >
> > After a thorough discussion [1], I'd like to call a formal vote to
> > graduate Apache Celeborn (incubating) as a TLP. Below are some facts
> > and project highlights carried from [1] as well as the draft
> > resolution:
> >
> > - Currently, our community consists of 19 committers (including
> > mentors) from more than 10 companies, with 12 serving as PPMC members.
> > - So far, we have boasted 81 contributors.
> > - Throughout the incubation period, we've made 6 releases in 16
> > months, at a stable pace.
> > - We've had 6 different release managers to date.
> > - Our software is used in production by 10+ well known entities.
> > - As yet, we have opened 1,286 issues with 1,176 successfully resolved.
> > - We have submitted a total of 1,816 PRs, out of which 1,805 have been
> > merged or closed.
> > - Through self-assessment [2], we have met all maturity criteria as
> > outlined in [3].
> >
> > We've resolved all branding issues which include Logo, GitHub repo,
> > document, website, and others [4] [5].
> >
> > --
> > Establish the Apache Celeborn Project
> >
> > WHEREAS, the Board of Directors deems it to be in the best interests of
> > the Foundation and consistent with the Foundation's purpose to establish
> > a Project Management Committee charged with the creation and maintenance
> > of open-source software, for distribution at no charge to the public,
> > related to an intermediate data service for big data computing engines
> > to boost performance, stability, and flexibility.
> >
> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> > (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> > established pursuant to Bylaws of the Foundation; and be it further
> >
> > RESOLVED, that the Apache Celeborn Project be and hereby is responsible
> > for the creation and maintenance of software related to an intermediate
> > data service for big data computing engines to boost performance,
> > stability, and flexibility; and be it further
> >
> > RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> > hereby is created, the person holding such office to serve at the
> > direction of the Board of Directors as the chair of the Apache Celeborn
> > Project, and to have primary responsibility for management of the
> > projects within the scope of responsibility of the Apache Celeborn
> > Project; and be it further
> >
> > RESOLVED, that the persons listed immediately below be and hereby are
> > appointed to serve as the initial members of the Apache Celeborn
> > Project:
> >
> >  * Becket Qin
> >  * Cheng Pan 
> >  * Duo Zhang 
> >  * Ethan Feng
> >  * Fu Chen   
> >  * Jiashu Xiong  
> >  * Kerwin Zhang  
> >  * Keyong Zhou   
> >  * Lidong Dai
> >  * Willem Ning Jiang 
> >  * Wu Wei
> >  * Yi Zhu
> >  * Yu Li 
> >
> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Keyong Zhou be appointed to
> > the office of Vice President, Apache Celeborn, to serve in accordance
> > with and subject to the direction of the Board of Directors and the
> > Bylaws of the Foundation until death, resignation, retirement, removal
> > or disqualification, or until a successor is appointed; and be it
> > further
> >
> > RESOLVED, that the Apache Celeborn Project be and hereby is tasked with
> > the migration and rationalization of the Apache Incubator Celeborn
> > podling; and be it further
> >
> > RESOLVED, that all responsibilities pertaining to the Apache Incubator
> > Celeborn podling encumbered upon the Apache Incubator PMC are hereafter
> > discharged.
> > --
> >
> > Best Regards,
> > Yu
> >
> > [1] https://lists.apache.org/thread/z17rs0mw4nyv0s112dklmv7s3j053mby
> > [2]
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/Apache+Maturity+Model+Assessment+for+Celeborn
> > [3]
> >
> https://community.apache.org/apache-way/apache-project-maturity-model.html
> > [4] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > [5] https://whimsy.apache.org/pods/project/celeborn
> >
>

Re: [DISCUSS] Graduate Celeborn as TLP

2024-02-27 Thread Keyong Zhou

Thanks Willian for the information, as Cheng said, we didn't start the
registration process before :)

Best,
Keyong Zhou

Willem Jiang  于2024年2月27日周二 18:48写道：

> It‘s OK if we don't register any trademark of Celeborn.
> If we already registered the trademark of Celeborn, we need to have
> the approval of the trademark VP.
>
> Willem Jiang
>
>
> On Tue, Feb 27, 2024 at 6:19 PM Cheng Pan  wrote:
> >
> > Hi Willem,
> >
> > For trademark concerns, the "Apache Celeborn” gets approval by ASF[1],
> do we need any additional work?
> >
> > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Feb 27, 2024, at 17:45, Willem Jiang 
> wrote:
> > >
> > > +1， it's good to see Celeborn is ready for graduation.
> > >
> > > I have a quick question about Celeborn's trademark. Did we start the
> > > registration process before?
> > >
> > > BTW  the podling name search is approved by trademark VP [1]
> > >
> > > [1] https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-206
> > >
> > > Willem Jiang
> > >
> > > On Tue, Feb 27, 2024 at 9:40 AM Yu Li  wrote:
> > >>
> > >> Dear Celeborn Devs,
> > >>
> > >> We, the Celeborn community, began our incubation journey on October
> > >> 18, 2022. Since then, with the continuous efforts of you all, our
> > >> community has steadily developed and gradually matured, approaching
> > >> the graduation criteria [1]. Therefore, I'd like to call a discussion
> > >> to graduate Celeborn as TLP. Below are some statistics I collected,
> > >> please check it and let me know your thoughts.
> > >>
> > >> - Currently, our community consists of 19 committers (including
> > >> mentors) from more than 10 companies, with 12 serving as PPMC members
> > >> [2].
> > >> - So far, we have boasted 81 contributors.
> > >> - Throughout the incubation period, we've made 6 releases [3] in 16
> > >> months, at a stable pace.
> > >> - We've had 6 different release managers to date.
> > >> - Our software is used in production by 10+ well known entities [4].
> > >> - As yet, we have opened 1,286 issues with 1,176 successfully
> resolved [5].
> > >> - We have submitted a total of 1,816 PRs, out of which 1,805 have been
> > >> merged or closed [6].
> > >> - Through self-assessment [7], we have met all maturity criteria as
> > >> outlined in [1].
> > >>
> > >> And below is the drafted graduation resolution, JFYI:
> > >> --
> > >> Establish the Apache Celeborn Project
> > >>
> > >> WHEREAS, the Board of Directors deems it to be in the best interests
> of
> > >> the Foundation and consistent with the Foundation's purpose to
> establish
> > >> a Project Management Committee charged with the creation and
> maintenance
> > >> of open-source software, for distribution at no charge to the public,
> > >> related to an intermediate data service for big data computing engines
> > >> to boost performance, stability, and flexibility.
> > >>
> > >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> > >> (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> > >> established pursuant to Bylaws of the Foundation; and be it further
> > >>
> > >> RESOLVED, that the Apache Celeborn Project be and hereby is
> responsible
> > >> for the creation and maintenance of software related to an
> intermediate
> > >> data service for big data computing engines to boost performance,
> > >> stability, and flexibility; and be it further
> > >>
> > >> RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> > >> hereby is created, the person holding such office to serve at the
> > >> direction of the Board of Directors as the chair of the Apache
> Celeborn
> > >> Project, and to have primary responsibility for management of the
> > >> projects within the scope of responsibility of the Apache Celeborn
> > >> Project; and be it further
> > >>
> > >> RESOLVED, that the persons listed immediately below be and hereby are
> > >> appointed to serve as the initial members of the Apache Celeborn
> > >> Project:
> > >>
>

Re: [DISCUSS] Graduate Celeborn as TLP

2024-02-27 Thread Keyong Zhou

Thanks @Gabriel, hope Celeborn can be useful in your environment someday :)

Best,
Keyong Zhou

Gabriel Lee  于2024年2月27日周二 11:43写道：

> Hi Yu,
>
> Very glad to witness Celeborn's growth. Now Celeborn has already become a
> leading and mature shuffle service project after a year and a half of
> incubation. This is my +1.
>
> Best,
> Gabriel
>
> On Tue, 27 Feb 2024 at 11:14, Cheng Pan  wrote:
>
> > Thanks, Yu, for driving this, overall I agree we can graduate and move
> > forward.
> >
> > I just found there are minor issues on the Podling Website Checks[1], the
> > PPMC is actively working on this.
> >
> > [1] https://whimsy.apache.org/pods/project/celeborn
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Feb 27, 2024, at 10:58, Nicholas Jiang 
> > wrote:
> > >
> > > Hi Yu,
> > >
> > >
> > >
> > >
> > > +1. Celeborn has active community with much contribution of developers
> > and many company production practice including my company bilibili. It's
> > time to start the graduation procedure. Forward to the graduation of
> Apache
> > Celeborn.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Nicholas Jiang
> > >
> > >
> > > At 2024-02-27 09:40:04, "Yu Li"  wrote:
> > >> Dear Celeborn Devs,
> > >>
> > >> We, the Celeborn community, began our incubation journey on October
> > >> 18, 2022. Since then, with the continuous efforts of you all, our
> > >> community has steadily developed and gradually matured, approaching
> > >> the graduation criteria [1]. Therefore, I'd like to call a discussion
> > >> to graduate Celeborn as TLP. Below are some statistics I collected,
> > >> please check it and let me know your thoughts.
> > >>
> > >> - Currently, our community consists of 19 committers (including
> > >> mentors) from more than 10 companies, with 12 serving as PPMC members
> > >> [2].
> > >> - So far, we have boasted 81 contributors.
> > >> - Throughout the incubation period, we've made 6 releases [3] in 16
> > >> months, at a stable pace.
> > >> - We've had 6 different release managers to date.
> > >> - Our software is used in production by 10+ well known entities [4].
> > >> - As yet, we have opened 1,286 issues with 1,176 successfully resolved
> > [5].
> > >> - We have submitted a total of 1,816 PRs, out of which 1,805 have been
> > >> merged or closed [6].
> > >> - Through self-assessment [7], we have met all maturity criteria as
> > >> outlined in [1].
> > >>
> > >> And below is the drafted graduation resolution, JFYI:
> > >> --
> > >> Establish the Apache Celeborn Project
> > >>
> > >> WHEREAS, the Board of Directors deems it to be in the best interests
> of
> > >> the Foundation and consistent with the Foundation's purpose to
> establish
> > >> a Project Management Committee charged with the creation and
> maintenance
> > >> of open-source software, for distribution at no charge to the public,
> > >> related to an intermediate data service for big data computing engines
> > >> to boost performance, stability, and flexibility.
> > >>
> > >> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> > >> (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> > >> established pursuant to Bylaws of the Foundation; and be it further
> > >>
> > >> RESOLVED, that the Apache Celeborn Project be and hereby is
> responsible
> > >> for the creation and maintenance of software related to an
> intermediate
> > >> data service for big data computing engines to boost performance,
> > >> stability, and flexibility; and be it further
> > >>
> > >> RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> > >> hereby is created, the person holding such office to serve at the
> > >> direction of the Board of Directors as the chair of the Apache
> Celeborn
> > >> Project, and to have primary responsibility for management of the
> > >> projects within the scope of responsibility of the Apache Celeborn
> > >> Project; and be it further
> > >>
> > >> RESOLVED, that the persons listed immediately below be a

[ANNONCE] New PPMC member: Fu Chen

2024-02-19 Thread Keyong Zhou

Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Fu Chen to become our PPMC member and
we are pleased to announce that he has accepted.

Fu Chen has been actively contributing to Celeborn community for more then
one year[1], including SBT build,
performance improvement, code refactor, bug fixes, code reviews, design
discussion, docs, etc.

Please join me in congratulating Fu Chen!

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A PPMC member helps manage and guide the direction of the project.

[1] https://github.com/apache/incubator-celeborn/commits?author=cfmcgrady

Thanks,
On behalf of the Apache Celeborn PPMC

Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Keyong Zhou

Got it, then let's continue using JIRA :)

Mridul Muralidharan  于2024年2月7日周三 15:10写道：

> Hi,
>
>   I am fine with either actually - though more used to jira personally :-)
> (github issues has a nice integrations with pr's which has been useful
> though).
> The main reason why I asked is what Nicholas clarified about - saw a
> nontrivial number of github issue related mails, and was not sure if we
> were moving to using that !
>
> Thanks,
> Mridul
>
>
> On Wed, Feb 7, 2024 at 12:52 AM Keyong Zhou  wrote:
>
> > Hi Mridul,
> >
> > Thanks for asking. In fact at the time when donating Celeborn to ASF
> > incubator we had a discussion whether to use JIRA or
> > Github for issue tracking and we decided to choose JIRA at last. Seems
> > different projects have different preferences. Maybe
> > newer projects tends to use Github.
> >
> > To me, I'm actually fine with both. JIRA works well so far, will using
> > Github be more beneficial? Glad to hear about your opinion.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Mridul Muralidharan  于2024年2月7日周三 14:03写道：
> >
> > >   Looks like I am wrong, github issues can be used [1].
> > > Is Celeborn planning to use github issues going forward ?
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > [1] https://www.apache.org/dev/#issues
> > >
> > >
> > > On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > >   I received a fairly large number of emails to
> > > > incubator-celeb...@noreply.github.com, which typically are for PR's.
> > > > They appear to be github issues - are we trying to move to github
> > issues
> > > > instead of Apache jira ? IIRC there is a policy to use jira for
> > tracking
> > > > bugs/improvements, right ?
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > >
> >
>

Re: Large number of incubator-celeb...@noreply.github.com emails

2024-02-06 Thread Keyong Zhou

Hi Mridul,

Thanks for asking. In fact at the time when donating Celeborn to ASF
incubator we had a discussion whether to use JIRA or
Github for issue tracking and we decided to choose JIRA at last. Seems
different projects have different preferences. Maybe
newer projects tends to use Github.

To me, I'm actually fine with both. JIRA works well so far, will using
Github be more beneficial? Glad to hear about your opinion.

Thanks,
Keyong Zhou

Mridul Muralidharan  于2024年2月7日周三 14:03写道：

>   Looks like I am wrong, github issues can be used [1].
> Is Celeborn planning to use github issues going forward ?
>
> Regards,
> Mridul
>
>
> [1] https://www.apache.org/dev/#issues
>
>
> On Wed, Feb 7, 2024 at 12:00 AM Mridul Muralidharan 
> wrote:
>
> > Hi,
> >
> >   I received a fairly large number of emails to
> > incubator-celeb...@noreply.github.com, which typically are for PR's.
> > They appear to be github issues - are we trying to move to github issues
> > instead of Apache jira ? IIRC there is a policy to use jira for tracking
> > bugs/improvements, right ?
> >
> > Regards,
> > Mridul
> >
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.4.0-incubating-rc6

2024-01-31 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- signatures are good.
```
gpg --import KEYS
gpg --verify apache-celeborn-0.4.0-incubating-source.tgz.asc
gpg --verify apache-celeborn-0.4.0-incubating-bin.tgz.asc
```
- checksums are good.
```
sha512sum --check apache-celeborn-0.4.0-incubating-source.tgz.sha512
sha512sum --check apache-celeborn-0.4.0-incubating-bin.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).
```
./build/make-distribution.sh --sbt-enabled --release
```

Thanks,
Keyong Zhou

Fu Chen  于2024年1月29日周一 21:46写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.4.0-incubating-rc6
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.4.0-incubating-rc6
>
>
> The git commit hash:
> 20a8576fc696f0208c24ab52e6ae883f5f0567d5
> source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.4.0-incubating-rc6
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1053
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Please be aware that there has been a transition in the Celeborn project's
> build tool, shifting from Maven to SBT. The SBT build documentation is
> available
> at https://celeborn.apache.org/docs/latest/developers/sbt/.
>
> For illustrative purposes:
>
> Packaging the project
> ```
> ./build/sbt clean package
> ```
>
> Creating the distribution
> ```
> ./build/make-distribution.sh --sbt-enabled --release
> ```
>
> Thanks,
> Fu Chen
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.4.0-incubating-rc4

2024-01-18 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- signatures are good.
```
gpg --import KEYS
gpg --verify apache-celeborn-0.4.0-incubating-source.tgz.asc
gpg --verify apache-celeborn-0.4.0-incubating-bin.tgz.asc
```
- checksums are good.
```
sha512sum --check apache-celeborn-0.4.0-incubating-source.tgz.sha512
sha512sum --check apache-celeborn-0.4.0-incubating-bin.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).
```
./build/make-distribution.sh --sbt-enabled --release
```

Thanks,
Keyong Zhou

Fu Chen  于2024年1月18日周四 21:40写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.4.0-incubating-rc4
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.4.0-incubating-rc4
>
>
> The git commit hash:
> 8bc07466dd85a90216820617015e329fb806c7dd source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.4.0-incubating-rc4
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1051
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Please be aware that there has been a transition in the Celeborn project's
> build tool, shifting from Maven to SBT. The SBT build documentation is
> available
> at https://celeborn.apache.org/docs/latest/developers/sbt/.
>
> For illustrative purposes:
>
> Packaging the project
> ```
> ./build/sbt clean package
> ```
>
> Creating the distribution
> ```
> ./build/make-distribution.sh --sbt-enabled --release
> ```
>
> Thanks,
> Fu Chen
>

[ANNOUNCE] Add Xiaofeng Jiang as new committer

2024-01-11 Thread Keyong Zhou

Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Xiaofeng Jiang to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Xiaofeng Jiang!

Thanks,
Keyong Zhou

Re: [VOTE] Release Apache Celeborn(Incubating) 0.4.0-incubating-rc3

2024-01-01 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
```

gpg --import KEYS

gpg --verify apache-celeborn-0.4.0-incubating-source.tgz.asc

gpg --verify apache-celeborn-0.4.0-incubating-bin.tgz.asc
```
- hashes are correct.
```

sha512sum --check apache-celeborn-0.4.0-incubating-source.tgz.sha512

sha512sum --check apache-celeborn-0.4.0-incubating-bin.tgz.sha512
```
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code. ``` ./build/make-distribution.sh
--sbt-enabled --release ```

Thanks,
Keyong Zhou

Fu Chen  于2024年1月1日周一 19:42写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.4.0-incubating-rc3
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.4.0-incubating-rc3
>
>
> The git commit hash:
> 5d94bf9dea735650b90fc4959390da2bfc67fc37 source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.4.0-incubating-rc3
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1050
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Please be aware that there has been a transition in the Celeborn project's
> build tool, shifting from Maven to SBT. The SBT build documentation is
> available
> at https://celeborn.apache.org/docs/latest/developers/sbt/.
>
> For illustrative purposes:
>
> Packaging the project
> ```
> ./build/sbt clean package
> ```
>
> Creating the distribution
> ```
> ./build/make-distribution.sh --sbt-enabled --release
> ```
>
> Thanks,
> Fu Chen
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.2-incubating-rc2

2023-12-31 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

Keyong Zhou  于2024年1月1日周一 09:44写道：

> I checked
> - git commit hash is correct.
> - links are valid.
> - "incubating" is in the name.
> - PGP keys are good.
> - hashes are correct.
> - LICENSE looks good.
> - NOTICE looks good.
> - DISCLAIMER exists.
> - build success from source code (macOS). ``` ./build/make-distribution.sh
> --release ```
>
> Thanks,
> Keyong Zhou
>
> Nicholas Jiang  于2023年12月29日周五 19:50写道：
>
>> Hi Celeborn community,
>>
>> This is a call for a vote to release Apache Celeborn (Incubating)
>>
>> 0.3.2-incubating-rc2
>>
>>
>> The git tag to be voted upon:
>>
>>
>> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating-rc2
>>
>>
>> The git commit hash:
>> 0dccad38e28554c36a5eef98de2540d996f946f7 source and binary artifacts can
>> be
>> found at:
>>
>>
>> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.2-incubating-rc2
>>
>>
>> The staging repo:
>> https://repository.apache.org/content/repositories/orgapacheceleborn-1048
>>
>>
>> Fingerprint of the PGP key release artifacts are signed with:
>> D73CADC1DAB63BD3C770BB6D9476842D24B7C885
>>
>> My public key to verify signatures can be found in:
>>
>> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>>
>>
>> The vote will be open for at least 72 hours or until the necessary
>> number of votes are reached.
>>
>>
>> Please vote accordingly:
>>
>>
>> [ ] +1 approve
>> [ ] +0 no opinion
>>
>> [ ] -1 disapprove (and the reason)
>>
>>
>> Checklist for release:
>>
>>
>> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
>> Steps to validate the release:
>>
>> https://www.apache.org/info/verification.html
>>
>> * Download links, checksums and PGP signatures are valid.
>> * Source code distributions have correct names matching the current
>> release.
>> * Release files have the word incubating in their name.
>> * DISCLAIMER, LICENSE and NOTICE files are correct.
>> * All files have license headers if necessary.
>> * No unlicensed compiled archives bundled in source archive.
>> * The source tarball matches the git tag.
>> * Build from source is successful.
>>
>> Regards,
>> Nicholas Jiang
>
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.2-incubating-rc2

2023-12-31 Thread Keyong Zhou

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

Nicholas Jiang  于2023年12月29日周五 19:50写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
>
> 0.3.2-incubating-rc2
>
>
> The git tag to be voted upon:
>
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating-rc2
>
>
> The git commit hash:
> 0dccad38e28554c36a5eef98de2540d996f946f7 source and binary artifacts can be
> found at:
>
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.2-incubating-rc2
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1048
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> D73CADC1DAB63BD3C770BB6D9476842D24B7C885
>
> My public key to verify signatures can be found in:
>
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
>
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
>
> https://www.apache.org/info/verification.html
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Regards,
> Nicholas Jiang

Re: [VOTE] Release Apache Celeborn(Incubating) 0.4.0-incubating-rc0

2023-12-21 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

Fu Chen  于2023年12月21日周四 21:41写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.4.0-incubating-rc0
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.4.0-incubating-rc0
>
>
> The git commit hash:
> de6d8d69af3381ee899ba8d92c5d63b332cbdfbf source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.4.0-incubating-rc0
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1046
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> 92AF4750DAFCB5E25B5B83EA76F54B977EB5C09B
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Please be aware that there has been a transition in the Celeborn project's
> build tool, shifting from Maven to SBT. The SBT build documentation is
> available
> at https://celeborn.apache.org/docs/latest/developers/sbt/.
>
> For illustrative purposes:
>
> Packaging the project
> ```
> ./build/sbt clean package
> ```
>
> Creating the distribution
> ```
> ./build/make-distribution.sh --sbt-enabled --release
> ```
>
> Thanks,
> Fu Chen
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.2-incubating-rc1

2023-12-21 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

Nicholas Jiang  于2023年12月21日周四 14:06写道：

> Hi Celeborn community,
>
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.3.2-incubating-rc1
>
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.2-incubating-rc1
>
>
> The git commit hash:
> bce190d8a0a53434ef57ef33e53720f5bf4d14d6 source and binary artifacts can be
> found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.2-incubating-rc1
>
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1045
>
>
> Fingerprint of the PGP key release artifacts are signed with:
> D73CADC1DAB63BD3C770BB6D9476842D24B7C885
>
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
>
> Please vote accordingly:
>
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
>
> Regards,
> Nicholas Jiang

Re: [DISCUSS] Time for 0.4.0

2023-12-13 Thread keyong zhou

+1, thanks Fu!

Fu Chen  于2023年12月14日周四 10:26写道：

> Hi, Celeborn community,
>
> It has been a while since the 0.3.0 release, and I think it’s time to
> prepare
> for the next feature release 0.4.0. And I’m volunteering to be the release
> manager if no others were applied.
>
> If no objections, I plan to cut branch-0.4 next week.
>
> Thanks,
> Fu Chen
>

Re: [DISCUSS] Time for 0.3.2

2023-12-07 Thread Keyong Zhou

+1 on 0.3.2, thanks Nicholas for volunteering!

angers zhu  于2023年12月7日周四 17:00写道：

> +1 on 0.3.2
>
> Yihe Li  于2023年12月7日周四 16:41写道：
>
> > +1, thanks Nicholas！
> >
> > On 2023/12/07 07:43:09 Shaoyun Chen wrote:
> > > +1  thanks Nicholas.
> > >
> > > Mridul Muralidharan  于2023年12月7日周四 15:03写道：
> > > >
> > > > +1 on 0.3.2, thanks Nicholas !
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > >
> > > > On Thu, Dec 7, 2023 at 12:51 AM Cheng Pan  wrote:
> > > >
> > > > > +1, thanks for volunteering.
> > > > >
> > > > > Feel free to ping me if you encounter permission issues during the
> > release
> > > > > phase.
> > > > >
> > > > > Thanks,
> > > > > Cheng Pan
> > > > >
> > > > >
> > > > > > On Dec 7, 2023, at 14:31, Nicholas  wrote:
> > > > > >
> > > > > > Hey, Celeborn community,
> > > > > >
> > > > > > It has been a while since the 0.3.1 release, and there are some
> > critical
> > > > > fixes land branch-0.3, for example, [CELEBORN-1037] Incorrect
> output
> > for
> > > > > metrics of Prometheus. From my perspective, it’s time to prepare
> for
> > > > > releasing 0.3.2.
> > > > > >
> > > > > > WDYT? And I’m volunteering to be the release manager if no one
> has
> > > > > applied.
> > > > > >
> > > > > > Regards,
> > > > > > Nicholas Jiang
> > > > >
> > > > >
> > > > >
> > >
> >
>

[ANNOUNCE] Add Yihe Li as new committer

2023-11-16 Thread Keyong Zhou

Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Yihe Li to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Yihe Li!

Thanks,
Keyong Zhou

Re: [ANNOUNCE] New Committer: Shaoyun Chen

2023-11-07 Thread Keyong Zhou

Congrats to Shaoyun Chen!

Cheng Pan  于2023年11月7日周二 19:12写道：

> Hi Celeborn Community,
>
> The Podling Project Management Committee (PPMC) for Apache Celeborn
> has invited Shaoyun Chen to become a committer and we are pleased
> to announce that he has accepted.
>
> Being a committer enables easier contribution to the
> project since there is no need to go via the patch
> submission process. This should enable better productivity.
> A (P)PMC member helps manage and guide the direction of the project.
>
> Please join me in congratulating Shaoyun Chen!
>
> Thanks,
> Cheng Pan
>
>
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-11-03 Thread Keyong Zhou

I checked RDD#getOutputDeterministicLevel and find that if an RDD's
upstream is INDETERMINATE,
then it's also INDETERMINATE.

Thanks,
Keyong Zhou

Keyong Zhou  于2023年11月3日周五 19:57写道：

> Hi Mridul,
>
> I still have a question. DAGScheduler#submitMissingTasks will
> only unregisterAllMapAndMergeOutput
> if the current ShuffleMapStage is Indeterminate. What if the current stage
> is determinate, but its
> upstream stage is Indeterminate, and its upstream stage is rerun?
>
> Thanks,
> Keyong Zhou
>
> Mridul Muralidharan  于2023年10月20日周五 11:15写道：
>
>> To add my response - what I described (w.r.t failing job) applies only to
>> ResultStage.
>> It walks the lineage DAG to identify all indeterminate parents to
>> rollback.
>> If there are only ShuffleMapStages in the set of stages to rollback, it
>> will simply discard their output, rollback all of them, and then retry
>> these stages (same shuffle-id, a new stage attempt)
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
>> wrote:
>>
>> >
>> > Good question, and ResultStage is actually special cased in spark as its
>> > output could have already been consumed (for example collect() to
>> driver,
>> > etc) - and so if it is one of the stages which needs to be rolled back,
>> the
>> > job is aborted.
>> >
>> > To illustrate, see the following:
>> > -- snip --
>> >
>> > package org.apache.spark
>> >
>> >
>> > import scala.reflect.ClassTag
>> >
>> > import org.apache.spark._
>> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
>> >
>> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
>> RDD[E](delegate) {
>> >
>> >   override def compute(split: Partition, context: TaskContext):
>> Iterator[E] = {
>> > delegate.compute(split, context)
>> >   }
>> >
>> >   override protected def getPartitions: Array[Partition] =
>> > delegate.partitions
>> > }
>> >
>> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
>> DelegatingRDD[E](delegate) {
>> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
>> DeterministicLevel.INDETERMINATE
>> > }
>> >
>> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
>> DelegatingRDD[E](delegate) {
>> >   override def compute(split: Partition, context: TaskContext):
>> Iterator[E] = {
>> > val tc = TaskContext.get
>> > if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
>> tc.attemptNumber() == 0) {
>> >   // Wait for all tasks to be done, then call exit
>> >   Thread.sleep(5000)
>> >   System.exit(-1)
>> > }
>> > delegate.compute(split, context)
>> >   }
>> > }
>> >
>> > // Make sure test_output directory is deleted before running this.
>> > //
>> > object Test {
>> >
>> >   def main(args: Array[String]): Unit = {
>> > val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
>> > val sc = new SparkContext(conf)
>> >
>> > val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
>> 20).map(v => (v, v)))
>> > val resultRdd = new FailingRDD(mapperRdd.groupByKey())
>> > resultRdd.saveAsTextFile("test_output")
>> >   }
>> > }
>> >
>> > -- snip --
>> >
>> >
>> >
>> > Here, the mapper stage has been forced to be INDETERMINATE.
>> > In the reducer stage, the first attempt to compute partition 0 will
>> wait for a bit and then exit - since the master is a local-cluster, this
>> results in FetchFailure when the second attempt of partition 0 tries to
>> fetch shuffle data.
>> > When spark tries to regenerate parent shuffle output, it sees that the
>> parent is INDETERMINATE - and so fails the entire job.with the message:
>> > "
>> > org.apache.spark.SparkException: Job aborted due to stage failure: A
>> shuffle map stage with indeterminate output was failed and retried.
>> However, Spark cannot rollback the ResultStage 1 to re-process the input
>> data, and has to fail this job. Please eliminate the indeterminacy by
>> checkpointing the RDD before repartition and try again.
>> > "
>> >
>> > This is coming from here <
>> https://github.com/apache/spark/blob/28292d51e7

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-11-03 Thread Keyong Zhou

Hi Mridul,

I still have a question. DAGScheduler#submitMissingTasks will
only unregisterAllMapAndMergeOutput
if the current ShuffleMapStage is Indeterminate. What if the current stage
is determinate, but its
upstream stage is Indeterminate, and its upstream stage is rerun?

Thanks,
Keyong Zhou

Mridul Muralidharan  于2023年10月20日周五 11:15写道：

> To add my response - what I described (w.r.t failing job) applies only to
> ResultStage.
> It walks the lineage DAG to identify all indeterminate parents to rollback.
> If there are only ShuffleMapStages in the set of stages to rollback, it
> will simply discard their output, rollback all of them, and then retry
> these stages (same shuffle-id, a new stage attempt)
>
>
> Regards,
> Mridul
>
>
>
> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
> wrote:
>
> >
> > Good question, and ResultStage is actually special cased in spark as its
> > output could have already been consumed (for example collect() to driver,
> > etc) - and so if it is one of the stages which needs to be rolled back,
> the
> > job is aborted.
> >
> > To illustrate, see the following:
> > -- snip --
> >
> > package org.apache.spark
> >
> >
> > import scala.reflect.ClassTag
> >
> > import org.apache.spark._
> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
> >
> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
> RDD[E](delegate) {
> >
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> > delegate.compute(split, context)
> >   }
> >
> >   override protected def getPartitions: Array[Partition] =
> > delegate.partitions
> > }
> >
> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
> DeterministicLevel.INDETERMINATE
> > }
> >
> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> > val tc = TaskContext.get
> > if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
> tc.attemptNumber() == 0) {
> >   // Wait for all tasks to be done, then call exit
> >   Thread.sleep(5000)
> >   System.exit(-1)
> > }
> > delegate.compute(split, context)
> >   }
> > }
> >
> > // Make sure test_output directory is deleted before running this.
> > //
> > object Test {
> >
> >   def main(args: Array[String]): Unit = {
> > val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> > val sc = new SparkContext(conf)
> >
> > val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
> 20).map(v => (v, v)))
> > val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> > resultRdd.saveAsTextFile("test_output")
> >   }
> > }
> >
> > -- snip --
> >
> >
> >
> > Here, the mapper stage has been forced to be INDETERMINATE.
> > In the reducer stage, the first attempt to compute partition 0 will wait
> for a bit and then exit - since the master is a local-cluster, this results
> in FetchFailure when the second attempt of partition 0 tries to fetch
> shuffle data.
> > When spark tries to regenerate parent shuffle output, it sees that the
> parent is INDETERMINATE - and so fails the entire job.with the message:
> > "
> > org.apache.spark.SparkException: Job aborted due to stage failure: A
> shuffle map stage with indeterminate output was failed and retried.
> However, Spark cannot rollback the ResultStage 1 to re-process the input
> data, and has to fail this job. Please eliminate the indeterminacy by
> checkpointing the RDD before repartition and try again.
> > "
> >
> > This is coming from here <
> https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
> - when rolling back stages, if spark determines that a ResultStage needs to
> be rolled back due to loss of INDETERMINATE output, it will fail the job.
> >
> > Hope this clarifies.
> > Regards,
> > Mridul
> >
> >
> > On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:
> >
> >> In fact, I'm wondering if Spark will rerun the whole reduce
> >> ShuffleMapStage
> >> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
> >>
> >> Keyong Zhou  于2023年1

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou

Hi Mridul, thanks for the explanation, it's clear to me now, Thanks!

Mridul Muralidharan  于2023年10月20日周五 11:15写道：

> To add my response - what I described (w.r.t failing job) applies only to
> ResultStage.
> It walks the lineage DAG to identify all indeterminate parents to rollback.
> If there are only ShuffleMapStages in the set of stages to rollback, it
> will simply discard their output, rollback all of them, and then retry
> these stages (same shuffle-id, a new stage attempt)
>
>
> Regards,
> Mridul
>
>
>
> On Thu, Oct 19, 2023 at 10:08 PM Mridul Muralidharan 
> wrote:
>
> >
> > Good question, and ResultStage is actually special cased in spark as its
> > output could have already been consumed (for example collect() to driver,
> > etc) - and so if it is one of the stages which needs to be rolled back,
> the
> > job is aborted.
> >
> > To illustrate, see the following:
> > -- snip --
> >
> > package org.apache.spark
> >
> >
> > import scala.reflect.ClassTag
> >
> > import org.apache.spark._
> > import org.apache.spark.rdd.{DeterministicLevel, RDD}
> >
> > class DelegatingRDD[E: ClassTag](delegate: RDD[E]) extends
> RDD[E](delegate) {
> >
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> > delegate.compute(split, context)
> >   }
> >
> >   override protected def getPartitions: Array[Partition] =
> > delegate.partitions
> > }
> >
> > class IndeterminateRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def getOutputDeterministicLevel: DeterministicLevel.Value =
> DeterministicLevel.INDETERMINATE
> > }
> >
> > class FailingRDD[E: ClassTag](delegate: RDD[E]) extends
> DelegatingRDD[E](delegate) {
> >   override def compute(split: Partition, context: TaskContext):
> Iterator[E] = {
> > val tc = TaskContext.get
> > if (tc.stageAttemptNumber() == 0 && tc.partitionId() == 0 &&
> tc.attemptNumber() == 0) {
> >   // Wait for all tasks to be done, then call exit
> >   Thread.sleep(5000)
> >   System.exit(-1)
> > }
> > delegate.compute(split, context)
> >   }
> > }
> >
> > // Make sure test_output directory is deleted before running this.
> > //
> > object Test {
> >
> >   def main(args: Array[String]): Unit = {
> > val conf = new SparkConf().setMaster("local-cluster[4,1,1024]")
> > val sc = new SparkContext(conf)
> >
> > val mapperRdd = new IndeterminateRDD(sc.parallelize(0 until 1,
> 20).map(v => (v, v)))
> > val resultRdd = new FailingRDD(mapperRdd.groupByKey())
> > resultRdd.saveAsTextFile("test_output")
> >   }
> > }
> >
> > -- snip --
> >
> >
> >
> > Here, the mapper stage has been forced to be INDETERMINATE.
> > In the reducer stage, the first attempt to compute partition 0 will wait
> for a bit and then exit - since the master is a local-cluster, this results
> in FetchFailure when the second attempt of partition 0 tries to fetch
> shuffle data.
> > When spark tries to regenerate parent shuffle output, it sees that the
> parent is INDETERMINATE - and so fails the entire job.with the message:
> > "
> > org.apache.spark.SparkException: Job aborted due to stage failure: A
> shuffle map stage with indeterminate output was failed and retried.
> However, Spark cannot rollback the ResultStage 1 to re-process the input
> data, and has to fail this job. Please eliminate the indeterminacy by
> checkpointing the RDD before repartition and try again.
> > "
> >
> > This is coming from here <
> https://github.com/apache/spark/blob/28292d51e7dbe2f3488e82435abb48d3d31f6044/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2090>
> - when rolling back stages, if spark determines that a ResultStage needs to
> be rolled back due to loss of INDETERMINATE output, it will fail the job.
> >
> > Hope this clarifies.
> > Regards,
> > Mridul
> >
> >
> > On Thu, Oct 19, 2023 at 10:04 AM Keyong Zhou  wrote:
> >
> >> In fact, I'm wondering if Spark will rerun the whole reduce
> >> ShuffleMapStage
> >> if its upstream ShuffleMapStage is INDETERMINATE and rerun.
> >>
> >> Keyong Zhou  于2023年10月19日周四 23:00写道：
> >>
> >> > Thanks Erik for bringing up this question, I'm also curious about the
> >> > answer, any feedback is appreciated.
> >> >
> >> > Thanks,
> &g

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou

In fact, I'm wondering if Spark will rerun the whole reduce ShuffleMapStage
if its upstream ShuffleMapStage is INDETERMINATE and rerun.

Keyong Zhou  于2023年10月19日周四 23:00写道：

> Thanks Erik for bringing up this question, I'm also curious about the
> answer, any feedback is appreciated.
>
> Thanks,
> Keyong Zhou
>
> Erik fang  于2023年10月19日周四 22:16写道：
>
>> Mridul,
>>
>> sure, I totally agree SPARK-25299 is a much better solution, as long as we
>> can get it from spark community
>> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
>> celeborn already has spark-integration code with  [spark] scope)
>>
>> I also have a question about INDETERMINATE stage recompute, and may need
>> your help
>> The rule for INDETERMINATE ShuffleMapStage rerun is reasonable, however, I
>> don't find related logic for INDETERMINATE ResultStage rerun in
>> DAGScheduler
>> If INDETERMINATE ShuffleMapStage got entirely recomputed, the
>> corresponding ResultStage should be entirely recomputed as well, per my
>> understanding
>>
>> I found https://issues.apache.org/jira/browse/SPARK-25342 to rollback a
>> ResultStage but it was not merged
>> Do you know any context or related ticket for INDETERMINATE ResultStage
>> rerun?
>>
>> Thanks in advance!
>>
>> Regards,
>> Erik
>>
>> On Tue, Oct 17, 2023 at 4:23 AM Mridul Muralidharan 
>> wrote:
>>
>> >
>> >
>> > On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:
>> >
>> >> Hi Mridul,
>> >>
>> >> For a),
>> >> DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier()
>> >> <
>> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1975
>> >
>> >> to decide whether the whole stage needs to be recomputed
>> >> I think we can pass the same information to Celeborn in
>> >> ShuffleManager.registerShuffle()
>> >> <
>> https://github.com/apache/spark/blob/721ea9bbb2ff77b6d2f575fdca0aeda84990cc3b/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala#L39>,
>> since
>> >> RDD in ShuffleDependency contains the RDD object
>> >> It seems Stage.isIndeterminate() is unreadable from ShuffleDependency,
>> >> but luckily rdd is used internally
>> >>
>> >> def isIndeterminate: Boolean = {
>> >>   rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
>> >> }
>> >>
>> >> Relies on internal implementation is not good, but doable.
>> >> I don't expect spark RDD/Stage implementation changes frequently, and
>> we
>> >> can discuss with Spark community for a RDD isIndeterminate API if they
>> >> change it in the future
>> >>
>> >
>> >
>> > Only RDD.getOutputDeterministicLevel is publicly exposed,
>> > RDD.outputDeterministicLevel is not and it is private[spark].
>> > While I dont expect changes to this, it is inherently unstable to depend
>> > on it.
>> >
>> > Btw, please see the discussion with Sungwoo Park, if Celeborn is
>> > maintaining a reducer oriented view, you will need to recompute all the
>> > mappers anyway - what you might save is the subset of reducer partitions
>> > which can be skipped if it is DETERMINATE.
>> >
>> >
>> >
>> >
>> >>
>> >> for c)
>> >> I also considered a similar solution in celeborn
>> >> Celeborn (LifecycleManager) can get the full picture of remaining
>> shuffle
>> >> data from previous stage attempt and reuse it in stage recompute
>> >> , and the whole process will be transparent to Spark/DagScheduler
>> >>
>> >
>> > Celeborn does not have visibility into this - and this is potentially
>> > subject to invasive changes in Apache Spark as it evolves.
>> > For example, I recently merged a couple of changes which would make this
>> > different in master compared to previous versions.
>> > Until the remote shuffle service SPIP is implemented and these are
>> > abstracted out & made pluggable, it will continue to be quite volatile.
>> >
>> > Note that the behavior for 3.5 and older is known - since Spark versions
>> > have been released - it is the behavior in master and future versions of
>> > Spark which is subject to change.
>> > So delivering on SPARK-25299 would future proof all remot

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-19 Thread Keyong Zhou

Thanks Erik for bringing up this question, I'm also curious about the
answer, any feedback is appreciated.

Thanks,
Keyong Zhou

Erik fang  于2023年10月19日周四 22:16写道：

> Mridul,
>
> sure, I totally agree SPARK-25299 is a much better solution, as long as we
> can get it from spark community
> (btw, private[spark] of RDD.outputDeterministicLevel is no big deal,
> celeborn already has spark-integration code with  [spark] scope)
>
> I also have a question about INDETERMINATE stage recompute, and may need
> your help
> The rule for INDETERMINATE ShuffleMapStage rerun is reasonable, however, I
> don't find related logic for INDETERMINATE ResultStage rerun in
> DAGScheduler
> If INDETERMINATE ShuffleMapStage got entirely recomputed, the
> corresponding ResultStage should be entirely recomputed as well, per my
> understanding
>
> I found https://issues.apache.org/jira/browse/SPARK-25342 to rollback a
> ResultStage but it was not merged
> Do you know any context or related ticket for INDETERMINATE ResultStage
> rerun?
>
> Thanks in advance!
>
> Regards,
> Erik
>
> On Tue, Oct 17, 2023 at 4:23 AM Mridul Muralidharan 
> wrote:
>
> >
> >
> > On Mon, Oct 16, 2023 at 11:31 AM Erik fang  wrote:
> >
> >> Hi Mridul,
> >>
> >> For a),
> >> DagScheduler uses Stage.isIndeterminate() and RDD.isBarrier()
> >> <
> https://github.com/apache/spark/blob/3e2470de7ea8b97dcdd8875ef25f044998fb7588/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1975
> >
> >> to decide whether the whole stage needs to be recomputed
> >> I think we can pass the same information to Celeborn in
> >> ShuffleManager.registerShuffle()
> >> <
> https://github.com/apache/spark/blob/721ea9bbb2ff77b6d2f575fdca0aeda84990cc3b/core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala#L39>,
> since
> >> RDD in ShuffleDependency contains the RDD object
> >> It seems Stage.isIndeterminate() is unreadable from ShuffleDependency,
> >> but luckily rdd is used internally
> >>
> >> def isIndeterminate: Boolean = {
> >>   rdd.outputDeterministicLevel == DeterministicLevel.INDETERMINATE
> >> }
> >>
> >> Relies on internal implementation is not good, but doable.
> >> I don't expect spark RDD/Stage implementation changes frequently, and we
> >> can discuss with Spark community for a RDD isIndeterminate API if they
> >> change it in the future
> >>
> >
> >
> > Only RDD.getOutputDeterministicLevel is publicly exposed,
> > RDD.outputDeterministicLevel is not and it is private[spark].
> > While I dont expect changes to this, it is inherently unstable to depend
> > on it.
> >
> > Btw, please see the discussion with Sungwoo Park, if Celeborn is
> > maintaining a reducer oriented view, you will need to recompute all the
> > mappers anyway - what you might save is the subset of reducer partitions
> > which can be skipped if it is DETERMINATE.
> >
> >
> >
> >
> >>
> >> for c)
> >> I also considered a similar solution in celeborn
> >> Celeborn (LifecycleManager) can get the full picture of remaining
> shuffle
> >> data from previous stage attempt and reuse it in stage recompute
> >> , and the whole process will be transparent to Spark/DagScheduler
> >>
> >
> > Celeborn does not have visibility into this - and this is potentially
> > subject to invasive changes in Apache Spark as it evolves.
> > For example, I recently merged a couple of changes which would make this
> > different in master compared to previous versions.
> > Until the remote shuffle service SPIP is implemented and these are
> > abstracted out & made pluggable, it will continue to be quite volatile.
> >
> > Note that the behavior for 3.5 and older is known - since Spark versions
> > have been released - it is the behavior in master and future versions of
> > Spark which is subject to change.
> > So delivering on SPARK-25299 would future proof all remote shuffle
> > implementations.
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >>
> >> Per my perspective, leveraging partial stage recompute and
> >> remaining shuffle data needs a lot of work to do in Celeborn
> >> I prefer to implement a simple whole stage recompute first with
> interface
> >> defined with recomputeAll = true flag, and explore partial stage
> recompute
> >> in seperate ticket as future optimization
> >> How do you think about it?
> >>
>

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou

Yeah, retaining the map output can reduce the needed tasks to be
recomputed for DETERMINATE stages when an output file is lost.
This is one important design tradeoff.

Currently Celeborn also supports MapPartition for Flink Batch, in
which case partition data is not aggregated, instead one mapper's
output is stored in one file (perhaps multiple files if split happens),
very similar to how ESS stores shuffle data. Combining MapPartition
with ReducePartition (aggregate partition data) in Celeborn the same
way how magnet does may be an interesting idea.

Thanks,
Keyong Zhou

Mridul Muralidharan  于2023年10月17日周二 00:01写道：

> With push based shuffle in Apache Spark (magnet), we have both the map
> output and reducer orientated merged output preserved - with reducer
> oriented view chosen by default for reads and fallback to mapper output
> when reducer output is missing/failures. That mitigates this specific issue
> for DETERMINATE stages and only subset which need recomputation are
> regenerated.
> With magnet only smaller blocks are pushed for merged data, so effective
> replication is much lower.
>
> In our Celeborn deployment we are still testing, we will enable replication
> for functional and operational reasons - probably move replication out of
> the write path to speed it up further.
>
>
> Regards,
> Mridul
>
> On Mon, Oct 16, 2023 at 9:08 AM Keyong Zhou  wrote:
>
> > Hi Sungwoo,
> >
> > What you are pointing out is very correct. Currently shuffle data
> > is distributed across `celeborn.master.slot.assign.maxWorkers` workers,
> > which defaults to 1, so I believe the cascading stage rerun will
> > definitely happen.
> >
> > I think setting ` celeborn.master.slot.assign.maxWorkers` to a smaller
> > value can help, especially in relatively large clusters. Turning on
> > replication
> > can also help, but it conflicts with the purpose why we do stage rerun
> > (i.e. we
> > don't want to turn on replication for resource consumption reason).
> >
> > We didn't thought about this before, thanks for pointing that out!
> >
> > Thanks,
> > Keyong Zhou
> >
> > Sungwoo Park  于2023年10月13日周五 02:22写道：
> >
> > > I have a question on how Celeborn distributes shuffle data among
> Celeborn
> > > workers.
> > >
> > > From our observation, it seems that whenever a Celeborn worker fails or
> > > gets killed (in a small cluster of less than 25 nodes), almost every
> edge
> > > is affected. Does this mean that an edge with multiple partitions
> usually
> > > distributes its shuffle data among all Celeborn workers?
> > >
> > > If this is the case, I think stage recomputation is unnecessary and
> just
> > > re-executing the entire DAG is a better approach. Our current
> > > implementation uses the following scheme for stage recomputation:
> > >
> > > 1. If a read failure occurs for shuffleId #1 for an edge, we pick up a
> > new
> > > shuffleId #2 for the same edge.
> > > 2. The upstream stage re-executes all tasks, but writes the output to
> > > shuffleId #2.
> > > 3. Tasks in the downstream stage re-try by reading from shuffleId #2.
> > >
> > > From our experiment, whenever a Celeborn worker fails and a read
> failure
> > > occurs for an edge, the re-execution of the upstream stage usally ends
> up
> > > with another read failure because some part of its input has also been
> > > lost. As a result, all upstream stages are eventually re-executed in a
> > > cascading manner. In essence, the failure of a Celeborn worker
> > invalidates
> > > all existing shuffleIds.
> > >
> > > (This is what we observed with Hive-MR3-Celeborn, but I guess stage
> > > recomputation in Spark will have to deal with the same problem.)
> > >
> > > Thanks,
> > >
> > > --- Sungwoo
> > >
> >
>

Re: Question on Celeborn workers,

2023-10-16 Thread Keyong Zhou

Hi Sungwoo,

What you are pointing out is very correct. Currently shuffle data
is distributed across `celeborn.master.slot.assign.maxWorkers` workers,
which defaults to 1, so I believe the cascading stage rerun will
definitely happen.

I think setting ` celeborn.master.slot.assign.maxWorkers` to a smaller
value can help, especially in relatively large clusters. Turning on
replication
can also help, but it conflicts with the purpose why we do stage rerun
(i.e. we
don't want to turn on replication for resource consumption reason).

We didn't thought about this before, thanks for pointing that out!

Thanks,
Keyong Zhou

Sungwoo Park  于2023年10月13日周五 02:22写道：

> I have a question on how Celeborn distributes shuffle data among Celeborn
> workers.
>
> From our observation, it seems that whenever a Celeborn worker fails or
> gets killed (in a small cluster of less than 25 nodes), almost every edge
> is affected. Does this mean that an edge with multiple partitions usually
> distributes its shuffle data among all Celeborn workers?
>
> If this is the case, I think stage recomputation is unnecessary and just
> re-executing the entire DAG is a better approach. Our current
> implementation uses the following scheme for stage recomputation:
>
> 1. If a read failure occurs for shuffleId #1 for an edge, we pick up a new
> shuffleId #2 for the same edge.
> 2. The upstream stage re-executes all tasks, but writes the output to
> shuffleId #2.
> 3. Tasks in the downstream stage re-try by reading from shuffleId #2.
>
> From our experiment, whenever a Celeborn worker fails and a read failure
> occurs for an edge, the re-execution of the upstream stage usally ends up
> with another read failure because some part of its input has also been
> lost. As a result, all upstream stages are eventually re-executed in a
> cascading manner. In essence, the failure of a Celeborn worker invalidates
> all existing shuffleIds.
>
> (This is what we observed with Hive-MR3-Celeborn, but I guess stage
> recomputation in Spark will have to deal with the same problem.)
>
> Thanks,
>
> --- Sungwoo
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc3

2023-10-06 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

Fu Chen  于2023年10月7日周六 10:27写道：

> +1
>
> I checked
> - download links are valid.
> - git commit hash is correct.
> - no binary files in the source release.
> - signatures are good.
> ```
> gpg --import KEYS
> gpg --verify apache-celeborn-0.3.1-incubating-source.tgz.asc
> gpg --verify apache-celeborn-0.3.1-incubating-bin.tgz.asc
> ```
> - checksums are good.
> ```
> sha512sum --check apache-celeborn-0.3.1-incubating-source.tgz.sha512
> sha512sum --check apache-celeborn-0.3.1-incubating-bin.tgz.sha512
> ```
> - build success from source code (ubuntu 18.04).
> ```
> ./build/mvn clean package -DskipTests -Pspark-3.3
> ```
>
> Shaoyun Chen  于2023年10月1日周日 16:34写道：
>
> > +1 (non-binding)
> >
> > I checked the following things:
> >
> > - signatures are good.
> > ```
> > gpg --import KEYS
> > gpg --verify apache-celeborn-0.3.1-incubating-source.tgz.asc
> > gpg --verify apache-celeborn-0.3.1-incubating-bin.tgz.asc
> > ```
> > - checksums are good.
> > ```
> > sha512sum --check apache-celeborn-0.3.1-incubating-source.tgz.sha512
> > sha512sum --check apache-celeborn-0.3.1-incubating-bin.tgz.sha512
> > ```
> > - build success from source code.
> > ```
> > ./build/make-distribution.sh -Pspark-3.2
> > ```
> >
> > Cheng Pan  于2023年9月28日周四 21:55写道：
> >
> >> +CC PJ Fanning and Duo Zhang, since you found license issue in previous
> >> RC vote[1]
> >>
> >> lists.apache.org
> >> <https://lists.apache.org/thread/8v4wy5o132rpsjync6465zztgjlf6h5p>
> >> [image: favicon.ico]
> >> <https://lists.apache.org/thread/8v4wy5o132rpsjync6465zztgjlf6h5p>
> >> <https://lists.apache.org/thread/8v4wy5o132rpsjync6465zztgjlf6h5p>
> >>
> >>
> >> On Sep 28, 2023, at 21:52, Cheng Pan  wrote:
> >>
> >> Hi Celeborn community,
> >>
> >> This is a call for a vote to release Apache Celeborn (Incubating)
> >> 0.3.1-incubating-rc3
> >>
> >> The git tag to be voted upon:
> >>
> >>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating-rc3
> >>
> >> The git commit hash:
> >> 861788810642ef19769b097b82bee70b92e0ace0
> >>
> >> The source and binary artifacts can be found at:
> >>
> >>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.1-incubating-rc3
> >>
> >> The staging repo:
> >>
> https://repository.apache.org/content/repositories/orgapacheceleborn-1040
> >>
> >> Fingerprint of the PGP key release artifacts are signed with:
> >> 8FC8075E1FDC303276C676EE8001952629BCC75D
> >>
> >> My public key to verify signatures can be found in:
> >> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> >>
> >> The vote will be open for at least 72 hours or until the necessary
> >> number of votes are reached.
> >>
> >> Please vote accordingly:
> >>
> >> [ ] +1 approve
> >> [ ] +0 no opinion
> >> [ ] -1 disapprove (and the reason)
> >>
> >> Checklist for release:
> >>
> >>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> >>
> >> Steps to validate the release:
> >> https://www.apache.org/info/verification.html
> >>
> >> Instructions for making binary artifacts from source:
> >> build/make-distribution.sh --release
> >>
> >> Thanks,
> >> Cheng Pan
> >>
> >>
> >>
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-06 Thread Keyong Zhou

Hi Sungwoo,

Sorry for the late reply. Reusing a committed shuffleId does not work in
current architecture, even after calling unregisterShuffle in
LifecycleManager,
because the cleanup of metadata is delayed and not guaranteed. It will
be more complicated when we consider graceful restart of workers.

If we want to reuse the shuffleId, we need to redesign the whole picture.

Thanks,
Keyong Zhou

Sungwoo Park  于2023年10月2日周一 13:23写道：

> Hi Keyong,
>
> Instead of picking up a new shuffleId, can we reuse an existing shuffleId
> after unregistering it? If the following plan worked, it would further
> simplify the implementation:
>
> 1. Downstream tasks fail because of read failures.
> 2. All active downstream tasks are killed, so the shuffleId is not used.
> 3. An upstream vertex unregisters the shuffleId.
> 4. The upstream vertex is re-executed normally. This re-execution
> automatically registers the same shuffleId.
>
> In summary, we would like to go back in time before the upstream vertex
> started by cleaning up the shuffleId. Could you please give a comment on
> this plan?
>
> Thank you,
>
> --- Sungwoo
>
> On Sat, 30 Sep 2023, Keyong Zhou wrote:
>
> > Hi Sungwoo,
> >
> > I think your approach works with current architecture of Celeborn,
> > and interpreting IOException when reading as read failure makes
> > sense. Currently only when CommitFiles fails will LifecycleManager
> > announce data lost.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Sungwoo Park  于2023年9月29日周五 22:05?道：
> >
> >>> Since the partition split has a good chance to contain data from almost
> >> all
> >>> upstream
> >>> mapper tasks, the cost of re-computing all upstream tasks may have
> little
> >>> difference
> >>> to re-computing the actual mapper tasks in most cases. Of course it's
> not
> >>> always true.
> >>>
> >>> To change from 'complete' to 'incomplete' also needs to refactor
> Worker's
> >>> logic, which
> >>> currently assumes that the succeeded attempts will not be changed after
> >>> final committing
> >>> files.
> >>>
> >>>> a subset of succeeded attempts. In Erik's proposal, the whole upstream
> >>>> stage will be rerun when data lost.
> >>
> >> Thank you for your response --- things are now much clearer.
> >>
> >> From your comments shown above, let me assume that:
> >>
> >> 1. The whole upstream stage is rerun in the case of read failure.
> >>
> >> 2. Currently it is not easy to change the state of a shuffleId from
> >> 'complete' to 'incomplete'.
> >>
> >> Then, for Celeborn-MR3, I would like to experiment with the following
> >> approach:
> >>
> >> 1. If read failures occur for shuffleId #1, we pick up a new shuffleId
> >> #2.
> >>
> >> 2. The upstream stage (or Vertex in the case of MR3) re-executes all
> tasks
> >> again, but writes the output to shuffleId #2.
> >>
> >> 3. Tasks in the downstream stage re-try by reading from shuffleId #2.
> >>
> >> Do you think this approach makes sense under the current architecture of
> >> Celeborn? If this approach is feasible, MR3 only needs to be notified of
> >> read failures due to lost data by Celeborn ShuffleClient. Or, we could
> >> just interpret IOException from Celeborn ShuffleClient as read failures,
> >> in which case we can implement stage recompute without requiring any
> >> extension of Celeborn. (However, it would be great if Celeborn
> >> ShuffleClient could announce lost data explicitly.)
> >>
> >> An industrial user of Hive-MR3-Celeborn is trying hard to save disk
> usage
> >> on Celeborn workers (which use SSDs of limited capacity), so stage
> >> recompute would be a great new feature to them.
> >>
> >> Thank you,
> >>
> >> --- Sungwoo
> >>
> >>
> >>
> >

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-09-30 Thread Keyong Zhou

Hi Sungwoo,

I think your approach works with current architecture of Celeborn,
and interpreting IOException when reading as read failure makes
sense. Currently only when CommitFiles fails will LifecycleManager
announce data lost.

Thanks,
Keyong Zhou

Sungwoo Park  于2023年9月29日周五 22:05写道：

> > Since the partition split has a good chance to contain data from almost
> all
> > upstream
> > mapper tasks, the cost of re-computing all upstream tasks may have little
> > difference
> > to re-computing the actual mapper tasks in most cases. Of course it's not
> > always true.
> >
> > To change from 'complete' to 'incomplete' also needs to refactor Worker's
> > logic, which
> > currently assumes that the succeeded attempts will not be changed after
> > final committing
> > files.
> >
> >> a subset of succeeded attempts. In Erik's proposal, the whole upstream
> >> stage will be rerun when data lost.
>
> Thank you for your response --- things are now much clearer.
>
> From your comments shown above, let me assume that:
>
> 1. The whole upstream stage is rerun in the case of read failure.
>
> 2. Currently it is not easy to change the state of a shuffleId from
> 'complete' to 'incomplete'.
>
> Then, for Celeborn-MR3, I would like to experiment with the following
> approach:
>
> 1. If read failures occur for shuffleId #1, we pick up a new shuffleId
> #2.
>
> 2. The upstream stage (or Vertex in the case of MR3) re-executes all tasks
> again, but writes the output to shuffleId #2.
>
> 3. Tasks in the downstream stage re-try by reading from shuffleId #2.
>
> Do you think this approach makes sense under the current architecture of
> Celeborn? If this approach is feasible, MR3 only needs to be notified of
> read failures due to lost data by Celeborn ShuffleClient. Or, we could
> just interpret IOException from Celeborn ShuffleClient as read failures,
> in which case we can implement stage recompute without requiring any
> extension of Celeborn. (However, it would be great if Celeborn
> ShuffleClient could announce lost data explicitly.)
>
> An industrial user of Hive-MR3-Celeborn is trying hard to save disk usage
> on Celeborn workers (which use SSDs of limited capacity), so stage
> recompute would be a great new feature to them.
>
> Thank you,
>
> --- Sungwoo
>
>
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-09-29 Thread Keyong Zhou

Since the partition split has a good chance to contain data from almost all
upstream
mapper tasks, the cost of re-computing all upstream tasks may have little
difference
to re-computing the actual mapper tasks in most cases. Of course it's not
always true.

To change from 'complete' to 'incomplete' also needs to refactor Worker's
logic, which
currently assumes that the succeeded attempts will not be changed after
final committing
files.

Thanks,
Keyong Zhou


Keyong Zhou  于2023年9月29日周五 19:43写道：

> Hi Sungwoo,
>
> Thanks for your reply. For the required two features you mentioned, here is
> my understanding.
>
> 1. Currently Celeborn Worker will track map indices for each partition
> split file
> if `celeborn.client.shuffle.rangeReadFilter.enabled` is enabled. This
> config's original
> purpose is to filter out unnecessary partition splits in mapper range
> reading. The
> map indices of each partition split will be tracked and returned to
> LifecycleManager
> through `CommitFilesResponse#committedMapIdBitMap`.
> So, as long as the `CommitFiles` succeeds, LifecycleManager will know
> about the
> mapper ids for each partition split. However, if `CommitFiles` fails, we
> have no way
> to get the information.
> This feature is also reused in memory storage design[1].
>
> 2. Currently LifecycleManager does not have a mechanism to change a
> shuffle from
> 'complete' to 'incomplete', because the succeeded attempts for each map
> task may
> already be propagated to executors and for reading. Shuffle Client will
> filter out data
> from non-successful attempts to ensure correctness, so it's hard to
> dynamically change
> a subset of succeeded attempts. In Erik's proposal, the whole upstream
> stage will
> be rerun when data lost.
>
> We are really glad to know about your efforts and progress to integrate
> Celeborn with MR3,
> I think it's a good example showing that Celeborn is a general purpose
> remote shuffle service
> for various big data compute engines. In fact we already mentioned MR3 in
> Celeborn's
> website[2]. So when your blog is ready, I'm happy to reference it in our
> website.
>
>
> [1]
> https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit#heading=h.fudf3s3zacpr
> [2]
> https://celeborn.apache.org/docs/latest/developers/overview/#compute-engine-integration
>
> Thanks,
> Keyong Zhou
>
>  于2023年9月28日周四 11:12写道：
>
>> Hello,
>>
>> As we are developing MR3 extension for Celeborn, I would like to add my
>> comments on stage re-run in the context of using Celeborn for MR3. I
>> don't
>> know the internal details of Spark stage re-run very well, so my apology
>> if my comments are irrelevant to the proposal in the design document.
>>
>> For Celeborn-MR3, we only need the following two features:
>>
>> 1. When mapper out is lost and read errors occur, CelebornIOException
>> from
>> ShuffleClientImpl includes the task index of the mapper (or a set of task
>> indexes) whose output has been lost.
>>
>> 2. ShuffleClientImpl notifies LifeCycleManager so that
>> ShuffleClient.mapperEnd(shuffleId, mapper_task_index, ...) can be called
>> again. In other words,  LifeCycleManager markes shuffleId from 'complete'
>> back to 'incomplete'.
>>
>> Then, the task-rexecution mechanism of MR3 can take care of the rest, by
>> re-executing the mapper and calling ShuffleClient.mapperEnd() again.
>>
>> From the proposal (if I understood it correctly), however, it seems that
>> 1
>> is not easy to implement in the current architecture of Celeborn (???):
>>
>> Celeborn doesn't know which mapper tasks need to be recomputed, unless
>> the
>> mapping of parititionId -> List is recorded and reported to
>> LifeCycleManager at committing time.
>>
>> By the way, we have finished the initial implementation of
>> Hive-MR3-Celeborn, and it works very reliably when tested with TPC-DS
>> 10TB
>> and the performance is also good. A release candidate is currently being
>> tested in production by a third parity. It could take a bit of time to
>> learn to use Hive-MR3-Celeborn, but Hive-MR3-Celeborn could be another
>> way
>> to run stress tests on Celeborn. For example, we produced the
>> EOFException
>> error when running stress tests by using speculative execution a lot and
>> intentionally giving heavy memory pressure. (We have quick start guides
>> for Hadoop, K8s, standalone mode, so it should take no more than a couple
>> of hours to learn to run Hive-MR3-Celeborn.) If you are interested,
>> please
>> let me know. Thank you.
>>
>> Best,
>>
>> -- Sungwoo
>>
>> On Fri, 22 Sep 2023, Erik fang wrote:
>>
>> > Hi folks,
>> >
>> > I have a proposal to implement Spark stage resubmission to handle
>> shuffle
>> > fetch failure in Celeborn
>> >
>> >
>> https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8
>> >
>> > please have a look and let me know what you think
>> >
>> > Regards,
>> > Erik
>> >
>>
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-09-29 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your reply. For the required two features you mentioned, here is
my understanding.

1. Currently Celeborn Worker will track map indices for each partition
split file
if `celeborn.client.shuffle.rangeReadFilter.enabled` is enabled. This
config's original
purpose is to filter out unnecessary partition splits in mapper range
reading. The
map indices of each partition split will be tracked and returned to
LifecycleManager
through `CommitFilesResponse#committedMapIdBitMap`.
So, as long as the `CommitFiles` succeeds, LifecycleManager will know about
the
mapper ids for each partition split. However, if `CommitFiles` fails, we
have no way
to get the information.
This feature is also reused in memory storage design[1].

2. Currently LifecycleManager does not have a mechanism to change a shuffle
from
'complete' to 'incomplete', because the succeeded attempts for each map
task may
already be propagated to executors and for reading. Shuffle Client will
filter out data
from non-successful attempts to ensure correctness, so it's hard to
dynamically change
a subset of succeeded attempts. In Erik's proposal, the whole upstream
stage will
be rerun when data lost.

We are really glad to know about your efforts and progress to integrate
Celeborn with MR3,
I think it's a good example showing that Celeborn is a general purpose
remote shuffle service
for various big data compute engines. In fact we already mentioned MR3 in
Celeborn's
website[2]. So when your blog is ready, I'm happy to reference it in our
website.

[1]
https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit#heading=h.fudf3s3zacpr
[2]
https://celeborn.apache.org/docs/latest/developers/overview/#compute-engine-integration

Thanks,
Keyong Zhou

 于2023年9月28日周四 11:12写道：

> Hello,
>
> As we are developing MR3 extension for Celeborn, I would like to add my
> comments on stage re-run in the context of using Celeborn for MR3. I don't
> know the internal details of Spark stage re-run very well, so my apology
> if my comments are irrelevant to the proposal in the design document.
>
> For Celeborn-MR3, we only need the following two features:
>
> 1. When mapper out is lost and read errors occur, CelebornIOException from
> ShuffleClientImpl includes the task index of the mapper (or a set of task
> indexes) whose output has been lost.
>
> 2. ShuffleClientImpl notifies LifeCycleManager so that
> ShuffleClient.mapperEnd(shuffleId, mapper_task_index, ...) can be called
> again. In other words,  LifeCycleManager markes shuffleId from 'complete'
> back to 'incomplete'.
>
> Then, the task-rexecution mechanism of MR3 can take care of the rest, by
> re-executing the mapper and calling ShuffleClient.mapperEnd() again.
>
> From the proposal (if I understood it correctly), however, it seems that 1
> is not easy to implement in the current architecture of Celeborn (???):
>
> Celeborn doesn't know which mapper tasks need to be recomputed, unless the
> mapping of parititionId -> List is recorded and reported to
> LifeCycleManager at committing time.
>
> By the way, we have finished the initial implementation of
> Hive-MR3-Celeborn, and it works very reliably when tested with TPC-DS 10TB
> and the performance is also good. A release candidate is currently being
> tested in production by a third parity. It could take a bit of time to
> learn to use Hive-MR3-Celeborn, but Hive-MR3-Celeborn could be another way
> to run stress tests on Celeborn. For example, we produced the EOFException
> error when running stress tests by using speculative execution a lot and
> intentionally giving heavy memory pressure. (We have quick start guides
> for Hadoop, K8s, standalone mode, so it should take no more than a couple
> of hours to learn to run Hive-MR3-Celeborn.) If you are interested, please
> let me know. Thank you.
>
> Best,
>
> -- Sungwoo
>
> On Fri, 22 Sep 2023, Erik fang wrote:
>
> > Hi folks,
> >
> > I have a proposal to implement Spark stage resubmission to handle shuffle
> > fetch failure in Celeborn
> >
> >
> https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8
> >
> > please have a look and let me know what you think
> >
> > Regards,
> > Erik
> >
>

Re: [DISCUSS] Support authentication in Celeborn

2023-09-16 Thread Keyong Zhou

Hi Chandni,

Thanks for the explanation, I agree that ensuring the security of
the client jar and its distribution falls outside the scope of adding
authentication to Celeborn.

I'm OK with the design doc, thanks! Let's see if other developers
have other feedbacks.

Thanks,
Keyong Zhou

Chandni Singh  于2023年9月17日周日 11:04写道：

> Hi Keyong,
>
> At present, Spark generates a unique secret for each application (code
> <
> https://github.com/apache/spark/blob/804f741453fb146b5261084fa3baf26631badb79/core/src/main/scala/org/apache/spark/util/Utils.scala#L2887
> >),
> which is then shared with ESS via Yarn. We were considering sharing this
> secret with Celeborn for the Spark applications. Java's SecureRandom is
> used to generate the random bytes for each application's secret, making it
> highly unlikely for two applications to generate identical secrets. For
> platforms that don't generate secrets, the LifecycleManager could handle
> this task, using similar logic for generation.
>
> The chances of two applications generating the same secret would imply a
> malicious client jar, as you pointed out. However, ensuring the security of
> the client jar and its distribution falls outside the scope of adding
> authentication to Celeborn.
>
> Chandni
>
> On Sat, Sep 16, 2023 at 5:37 AM Keyong Zhou  wrote:
>
> > Hi Chandni,
> >
> > Thanks for the detailed explanation. For question 3 I still have
> questions.
> > I do mean that two applications claim to be the the same AppX, but only
> the
> > first app registers itself through step5-7, the second app only goes
> > through
> > step 5.a and 5.b. Seems the shared secret is generated by
> LifecycleManager,
> > so is it possible that the second app goes through 5.a/b with the same
> > 
> > as the first app, then it generates the same shared secret? If so, the
> > second app
> > can skip step 6-7, and creates connection with servers claiming it's also
> > AppX.
> >
> > This problem has an assumption that the client jar is malicious. Maybe we
> > don't need
> > to consider such situation for now, I'm just thinking about the
> > possibility.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Chandni Singh  于2023年9月16日周六 14:03写道：
> >
> > > Hi Keyong,
> > > Thanks for reviewing the proposal.
> > > 1. Should we store the shard secrets in Ratis among masters since
> leader
> > > may change at any time?
> > > That's a good point. The secret should be stored in Ratis, as you've
> > > mentioned, because the leader can change at any time. Mridul and I
> > > discussed this but haven't yet included it in the document. We will
> need
> > to
> > > enable TLS communication between the masters, which I believe Ratis
> > > supports. Ratis also maintains a local log where state information is
> > > persisted. Since we're dealing with secrets, encryption may be
> necessary,
> > > although that might be beyond the current scope. Once we add support
> for
> > > encryption at rest, we can then implement encrypted secret storage
> within
> > > Ratis..
> > >
> > > 2. In case of worker graceful restart, should the worker store the
> shared
> > > secret in leveldb before it stops,
> > > or ask the master after it restarts? (The later seems to be
> > necessary)
> > > In the preferred approach, where workers can retrieve the secret from
> the
> > > master, this method should suffice even after a worker restarts.
> Although
> > > this does increase the load on the master for sharing the secret, I
> don't
> > > believe it worsens the situation when employed during graceful
> restarts.
> > > Storing the information in LevelDB is also an option; however, since
> > we're
> > > storing secrets, encryption would be advisable. As it stands, even ESS
> > > stores unencrypted secrets in LevelDB, which is unacceptable for
> > > applications with strict security requirements.
> > >
> > > 3. In 5.a/b, what happens if two applications send the same payload?
> Will
> > > they get the same shared secret?
> > > Do you mean that two applications claim to be, let's say, AppX? If so,
> > the
> > > application that first registers with the master as AppX will proceed
> to
> > > communicate with the Celeborn service. Once an application has
> registered
> > > as AppX, the master will not permit any other application to register
> > with
> > > the same identifier. The master will then terminate the connection with
> > the
> > > second a

Re: [DISCUSS] Support authentication in Celeborn

2023-09-16 Thread Keyong Zhou

Hi Chandni,

Thanks for the detailed explanation. For question 3 I still have questions.
I do mean that two applications claim to be the the same AppX, but only the
first app registers itself through step5-7, the second app only goes through
step 5.a and 5.b. Seems the shared secret is generated by LifecycleManager,
so is it possible that the second app goes through 5.a/b with the same

as the first app, then it generates the same shared secret? If so, the
second app
can skip step 6-7, and creates connection with servers claiming it's also
AppX.

This problem has an assumption that the client jar is malicious. Maybe we
don't need
to consider such situation for now, I'm just thinking about the possibility.

Thanks,
Keyong Zhou

Chandni Singh  于2023年9月16日周六 14:03写道：

> Hi Keyong,
> Thanks for reviewing the proposal.
> 1. Should we store the shard secrets in Ratis among masters since leader
> may change at any time?
> That's a good point. The secret should be stored in Ratis, as you've
> mentioned, because the leader can change at any time. Mridul and I
> discussed this but haven't yet included it in the document. We will need to
> enable TLS communication between the masters, which I believe Ratis
> supports. Ratis also maintains a local log where state information is
> persisted. Since we're dealing with secrets, encryption may be necessary,
> although that might be beyond the current scope. Once we add support for
> encryption at rest, we can then implement encrypted secret storage within
> Ratis..
>
> 2. In case of worker graceful restart, should the worker store the shared
> secret in leveldb before it stops,
> or ask the master after it restarts? (The later seems to be necessary)
> In the preferred approach, where workers can retrieve the secret from the
> master, this method should suffice even after a worker restarts. Although
> this does increase the load on the master for sharing the secret, I don't
> believe it worsens the situation when employed during graceful restarts.
> Storing the information in LevelDB is also an option; however, since we're
> storing secrets, encryption would be advisable. As it stands, even ESS
> stores unencrypted secrets in LevelDB, which is unacceptable for
> applications with strict security requirements.
>
> 3. In 5.a/b, what happens if two applications send the same payload? Will
> they get the same shared secret?
> Do you mean that two applications claim to be, let's say, AppX? If so, the
> application that first registers with the master as AppX will proceed to
> communicate with the Celeborn service. Once an application has registered
> as AppX, the master will not permit any other application to register with
> the same identifier. The master will then terminate the connection with the
> second application attempting to register as AppX. That application will
> not be able to connect to the service any longer, as its secret was never
> registered with the master.As of now, we don't have any plans to support
> TTL.
>
> 4. The doc says TTL is out of scope, is there a plan to support TTL in the
> future?
> No, we don't have any plans to support it as of now.
>
> I am going to incorporate some of these points in the doc as well.
>
> - Chandni
>
> On Fri, Sep 15, 2023 at 10:21 PM Keyong Zhou  wrote:
>
> > Hi Chandni & Mridul,
> >
> > Thanks for proposing this great feature! I've reviewed the design doc and
> > it LGTM overall. Still I have a few questions that
> > are not present in the proposal (maybe too detailed):
> >
> > 1. Should we store the shard secrets in Ratis among masters since leader
> > may change at any time?
> > 2. In case of worker graceful restart, should the worker store the shared
> > secret in leveldb before it stops,
> > or ask the master after it restarts? (The later seems to be
> necessary)
> > 3. In 5.a/b, what happens if two applications send the same payload? Will
> > they get the same shared secret?
> > 4. The doc says TTL is out of scope, is there a plan to support TTL in
> the
> > future?
> >
> > Thanks,
> > Keyong Zhou
> >
> > Chandni Singh  于2023年9月15日周五 06:34写道：
> >
> > > Hello Celeborn community,
> > >
> > > We have a proposal to add authentication to Celeborn:
> > >
> > >
> >
> https://docs.google.com/document/d/1D1U2COYhS3ob7l0t2WghRhBk_Fci9RGx-2FBXA3nvXk/edit#heading=h.m97qw1fpl5kv
> > >
> > > Would really appreciate feedback from the community on this proposal.
> > >
> > > Please let me know if there is a particular format that the Celeborn
> > > community follows for proposals and I will convert it into that format.
> > >
> > > Thank you
> > > Chandni
> > >
> >
>

Re: [DISCUSS] Support authentication in Celeborn

2023-09-15 Thread Keyong Zhou

Hi Chandni & Mridul,

Thanks for proposing this great feature! I've reviewed the design doc and
it LGTM overall. Still I have a few questions that
are not present in the proposal (maybe too detailed):

1. Should we store the shard secrets in Ratis among masters since leader
may change at any time?
2. In case of worker graceful restart, should the worker store the shared
secret in leveldb before it stops,
or ask the master after it restarts? (The later seems to be necessary)
3. In 5.a/b, what happens if two applications send the same payload? Will
they get the same shared secret?
4. The doc says TTL is out of scope, is there a plan to support TTL in the
future?

Thanks,
Keyong Zhou

Chandni Singh  于2023年9月15日周五 06:34写道：

> Hello Celeborn community,
>
> We have a proposal to add authentication to Celeborn:
>
> https://docs.google.com/document/d/1D1U2COYhS3ob7l0t2WghRhBk_Fci9RGx-2FBXA3nvXk/edit#heading=h.m97qw1fpl5kv
>
> Would really appreciate feedback from the community on this proposal.
>
> Please let me know if there is a particular format that the Celeborn
> community follows for proposals and I will convert it into that format.
>
> Thank you
> Chandni
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc2

2023-09-11 Thread Keyong Zhou

+1 (binding)

I checked
- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS). ``` ./build/make-distribution.sh
--release ```

Thanks,
Keyong Zhou

rexxiong  于2023年9月11日周一 22:08写道：

> +1 (binding)
> I checked
> - Download links are valid.
> - git commit hash is correct
> - Checksums and signatures are valid.
> - No binary files in the source release
> - Files have the word incubating in their name.
> - DISCLAIMER,LICENSE and NOTICE files exist.
> - Successfully built the binary from the source on MacOs with Command:
> ./build/make-distribution.sh --release
>
> Thanks,
> Jiashu Xiong
>
> Cheng Pan  于2023年9月11日周一 16:27写道：
>
> > Hi Celeborn community,
> >
> > This is a call for a vote to release Apache Celeborn (Incubating)
> > 0.3.1-incubating-rc2
> >
> > The git tag to be voted upon:
> >
> >
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating-rc2
> >
> > The git commit hash:
> > 7ec5596748af49ef9cb429d08550e89d94d5cc74
> >
> > The source and binary artifacts can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.1-incubating-rc2
> >
> > The staging repo:
> >
> https://repository.apache.org/content/repositories/orgapacheceleborn-1039
> >
> > Fingerprint of the PGP key release artifacts are signed with:
> > 8FC8075E1FDC303276C676EE8001952629BCC75D
> >
> > My public key to verify signatures can be found in:
> > https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> >
> > The vote will be open for at least 72 hours or until the necessary
> > number of votes are reached.
> >
> > Please vote accordingly:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove (and the reason)
> >
> > Checklist for release:
> >
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> >
> > Steps to validate the release:
> > https://www.apache.org/info/verification.html
> >
> > Instructions for making binary artifacts from source:
> > build/make-distribution.sh --release
> >
> > Thanks,
> > Cheng Pan
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc1

2023-09-07 Thread Keyong Zhou

Seems the bugfix[1] is critical for supporting Flink, so I suggest
preparing rc2.

[1] https://github.com/apache/incubator-celeborn/pull/1881

Thanks,
Keyong Zhou

Zhongqiang Chen  于2023年9月6日周三 21:13写道：

> -1I am so sorry. There is a bugfix for MapPartition Split.For more
> Information, please see this PR:
> https://github.com/apache/incubator-celeborn/pull/1881
> Regards,
> ZhongqiangChen
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2023-09-06 15:52:16, "Cheng Pan"  wrote:
> >Hi Celeborn community,
> >
> >This is a call for a vote to release Apache Celeborn (Incubating)
> >0.3.1-incubating-rc1
> >
> >The git tag to be voted upon:
> >
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.1-incubating-rc1
> >
> >The git commit hash:
> >6814a634f2c21a95d3199c4c7d0bca8f1be55cc2 source and binary artifacts can
> be
> >found at:
> >
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.1-incubating-rc1
> >
> >The staging repo:
> >https://repository.apache.org/content/repositories/orgapacheceleborn-1038
> >
> >Fingerprint of the PGP key release artifacts are signed with:
> >8FC8075E1FDC303276C676EE8001952629BCC75D
> >
> >My public key to verify signatures can be found in:
> >https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> >
> >The vote will be open for at least 72 hours or until the necessary
> >number of votes are reached.
> >
> >Please vote accordingly:
> >
> >[ ] +1 approve
> >[ ] +0 no opinion
> >[ ] -1 disapprove (and the reason)
> >
> >Checklist for release:
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> >
> >Steps to validate the release:
> >https://www.apache.org/info/verification.html
> >
> >
> >Instructions for making binary artifacts from source:
> >build/make-distribution.sh --release
> >
> >Thanks,
> >Cheng Pan
> >
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc0

2023-09-04 Thread Keyong Zhou

I found a perf degradation when spark's partition coalesce takes effect.
This PR[1] fixes it, and
I tested the following with [1]:
1. 1T TPCDS with row shuffle/columnar shuffle/columnar shuffle + codegen,
it's OK
2. 0.3.0-incubating client with 0.3.1 server, it's OK
3. graceful shutdown, it's OK
4. decommission through http request needs to be fixed by [2]

[1] https://github.com/apache/incubator-celeborn/pull/1876
[2] https://github.com/apache/incubator-celeborn/pull/1877

Cheng Pan  于2023年9月4日周一 21:08写道：

> This vote failed because there is one binding -1, will prepare the next RC
> in a few days.
>
> Thanks to all who participated in the voting to help with this release.
>
> Thanks,
> Cheng Pan
>
>
> > On Sep 1, 2023, at 15:38, Cheng Pan  wrote:
> >
> > Ethan, could you please create the corresponding JIRA ticket(s) and set
> the priority to blocker?
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >> On Sep 1, 2023, at 15:33, Ethan Feng  wrote:
> >>
> >> -1(bingding)
> >> The deploy doc in this RC needs to be updated.
> >>
> >> Regards,
> >> Ethan
> >>
> >> 在 2023年9月1日星期五，Mridul Muralidharan  写道：
> >>
> >>> +1
> >>>
> >>> Signatures, digests, license, etc check out fine.
> >>> Checked out tag and build/tested with -Pspark-3.1
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>>
> >>> On Thu, Aug 31, 2023 at 11:35 AM Cheng Pan  wrote:
> >>>
>  Hi Celeborn community,
> 
>  This is a call for a vote to release Apache Celeborn (Incubating)
>  0.3.1-incubating-rc0
> 
>  The git tag to be voted upon:
> 
>  https://github.com/apache/incubator-celeborn/releases/tag/
> >>> v0.3.1-incubating-rc0
> 
>  The git commit hash:
>  b3992274e207959125d8784d9b61a6e8043612fc source and binary artifacts
> >>> can be
>  found at:
> 
>  https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0
> >>> .3.1-incubating-rc0
> 
>  The staging repo:
>  https://repository.apache.org/content/repositories/orgapache
> >>> celeborn-1037
> 
>  Fingerprint of the PGP key release artifacts are signed with:
>  8FC8075E1FDC303276C676EE8001952629BCC75D
> 
>  My public key to verify signatures can be found in:
>  https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> 
>  The vote will be open for at least 72 hours or until the necessary
>  number of votes are reached.
> 
>  Please vote accordingly:
> 
>  [ ] +1 approve
>  [ ] +0 no opinion
>  [ ] -1 disapprove (and the reason)
> 
>  Checklist for release:
> 
>  https://cwiki.apache.org/confluence/display/INCUBATOR/Incuba
> >>> tor+Release+Checklist
>  Steps to validate the release:
>  https://www.apache.org/info/verification.html
> 
>  * Download links, checksums and PGP signatures are valid.
>  * Source code distributions have correct names matching the current
>  release.
>  * Release files have the word incubating in their name.
>  * DISCLAIMER, LICENSE and NOTICE files are correct.
>  * All files have license headers if necessary.
>  * No unlicensed compiled archives bundled in source archive.
>  * The source tarball matches the git tag.
>  * Build from source is successful.
> 
>  Thanks,
>  Cheng Pan
> 
> >>>
> >
>
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.1-incubating-rc0

2023-09-01 Thread Keyong Zhou

I will have a thorough test on spark for this rc

Thanks,
Keyong Zhou

Ethan Feng  于2023年9月1日周五 15:48写道：

> The Jira ticket is CELEBORN-941. I fixed it yesterday but I didn't
> realize that merge_pr.sh failed to copy it into branch-0.3.
>
> Regards,
> Ethan
>
> Cheng Pan  于2023年9月1日周五 15:39写道：
> >
> > Ethan, could you please create the corresponding JIRA ticket(s) and set
> the priority to blocker?
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Sep 1, 2023, at 15:33, Ethan Feng  wrote:
> > >
> > > -1(bingding)
> > > The deploy doc in this RC needs to be updated.
> > >
> > > Regards,
> > > Ethan
> > >
> > > 在 2023年9月1日星期五，Mridul Muralidharan  写道：
> > >
> > >> +1
> > >>
> > >> Signatures, digests, license, etc check out fine.
> > >> Checked out tag and build/tested with -Pspark-3.1
> > >>
> > >> Regards,
> > >> Mridul
> > >>
> > >>
> > >> On Thu, Aug 31, 2023 at 11:35 AM Cheng Pan  wrote:
> > >>
> > >>> Hi Celeborn community,
> > >>>
> > >>> This is a call for a vote to release Apache Celeborn (Incubating)
> > >>> 0.3.1-incubating-rc0
> > >>>
> > >>> The git tag to be voted upon:
> > >>>
> > >>> https://github.com/apache/incubator-celeborn/releases/tag/
> > >> v0.3.1-incubating-rc0
> > >>>
> > >>> The git commit hash:
> > >>> b3992274e207959125d8784d9b61a6e8043612fc source and binary artifacts
> > >> can be
> > >>> found at:
> > >>>
> > >>> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0
> > >> .3.1-incubating-rc0
> > >>>
> > >>> The staging repo:
> > >>> https://repository.apache.org/content/repositories/orgapache
> > >> celeborn-1037
> > >>>
> > >>> Fingerprint of the PGP key release artifacts are signed with:
> > >>> 8FC8075E1FDC303276C676EE8001952629BCC75D
> > >>>
> > >>> My public key to verify signatures can be found in:
> > >>> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> > >>>
> > >>> The vote will be open for at least 72 hours or until the necessary
> > >>> number of votes are reached.
> > >>>
> > >>> Please vote accordingly:
> > >>>
> > >>> [ ] +1 approve
> > >>> [ ] +0 no opinion
> > >>> [ ] -1 disapprove (and the reason)
> > >>>
> > >>> Checklist for release:
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/INCUBATOR/Incuba
> > >> tor+Release+Checklist
> > >>> Steps to validate the release:
> > >>> https://www.apache.org/info/verification.html
> > >>>
> > >>> * Download links, checksums and PGP signatures are valid.
> > >>> * Source code distributions have correct names matching the current
> > >>> release.
> > >>> * Release files have the word incubating in their name.
> > >>> * DISCLAIMER, LICENSE and NOTICE files are correct.
> > >>> * All files have license headers if necessary.
> > >>> * No unlicensed compiled archives bundled in source archive.
> > >>> * The source tarball matches the git tag.
> > >>> * Build from source is successful.
> > >>>
> > >>> Thanks,
> > >>> Cheng Pan
> > >>>
> > >>
> >
>

Re: [DISCUSS] Time for 0.3.1

2023-08-29 Thread Keyong Zhou

Hi Pan,

I think the PR[1] and the ISSUE[2] should be put into 0.3.1.
I suppose both of the two will either catch up with the deadline,
or only delay for few days.

[1] https://github.com/apache/incubator-celeborn/pull/1550
[2] https://issues.apache.org/jira/browse/CELEBORN-928

Thanks,
Keyong Zhou

Cheng Pan  于2023年8月29日周二 14:26写道：

> Thanks, I plan to cut the 0.3.1-rc0 on Fri Aug 30.
>
> Please let me know if you have any PRs that want to be shipped.
>
> Thanks,
> Cheng Pan
>
> On Mon, Aug 28, 2023 at 2:08 PM Keyong Zhou  wrote:
> >
> > Thanks Pan for volunteering! I also think it's time to release 0.3.1.
> >
> > Thanks,
> > Keyong Zhou
> >
> > Binjie Yang  于2023年8月28日周一 10:53写道：
> >
> > > +1, thanks for driving this.
> > >
> > > Thanks,
> > > Binjie Yang
> > >
> > > On 2023/08/28 02:48:02 Cheng Pan wrote:
> > > > Hi, Celeborn community,
> > > >
> > > > It has been a while since the 0.3.0 release, and there are some
> critical
> > > fixes land branch-0.3, I think it’s time to prepare for releasing
> 0.3.1.
> > > >
> > > > WDYT? And I’m volunteering to be the release manager if no one has
> > > applied.
> > > >
> > > > Thanks,
> > > > Cheng Pan
> > > >
> > > >
> > > >
> > >
>

Re: [DISCUSS] Time for 0.3.1

2023-08-28 Thread Keyong Zhou

Thanks Pan for volunteering! I also think it's time to release 0.3.1.

Thanks,
Keyong Zhou

Binjie Yang  于2023年8月28日周一 10:53写道：

> +1, thanks for driving this.
>
> Thanks,
> Binjie Yang
>
> On 2023/08/28 02:48:02 Cheng Pan wrote:
> > Hi, Celeborn community,
> >
> > It has been a while since the 0.3.0 release, and there are some critical
> fixes land branch-0.3, I think it’s time to prepare for releasing 0.3.1.
> >
> > WDYT? And I’m volunteering to be the release manager if no one has
> applied.
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
>

Re: Question on speculative execution,

2023-08-20 Thread Keyong Zhou

Hi Sungwoo,

Is there any other Exceptions when 'Premature EOF from inputStream' occurs?
Could you send the log file
of the reduce task?

Thanks,
Keyong Zhou

Sungwoo Park  于2023年8月21日周一 12:32写道：

> Hi Keyong.
>
> Thanks for your reply. We call mapperEnd() in attempt #2 (which is
> followed by a call to ShuffleClient.cleanup()). Also, attempt #1 is killed
> after attempt #2 is finished.
>
> It looks like 'Premature EOF from inputStream' error occurs after a
> taskattempt is interrupted while it keeps printing error messages like:
>
> 2023-08-19 11:52:16,119 [celeborn-retry-sender-21] INFO
> org.apache.celeborn.client.ShuffleClientImpl [] - Revive for push data
> success, new location for shuffle 1005007 map 408 attempt 0 partition 0
> batch 1 is location PartitionLocation[ ... ].
>
> Do you have any comments about this? We call Celeborn-API in a standard
> way (using only pushData(), mapperEnd(), cleanup(), etc).
>
> Thanks,
>
> --- Sungwoo
>
>
> On Mon, Aug 21, 2023 at 12:18 PM Keyong Zhou  wrote:
>
>> Hi Sungwoo,
>>
>> Thanks for your mail! For your questions:
>>
>> 1. No, your implementation does not violate the usage of Celeborn-API,
>> and speculative execution
>> is supported. Do you call mapperEnd in attempt #2? I think you can
>> kill attempt #1 after the
>> invocation of mapperEnd in attempt #2 succeeds.
>>
>> 2. Since speculation execution is allowed, we can safely kill a task
>> attempt when another attempt
>> succeeds.
>>
>> Thanks,
>> Keyong Zhou
>>
>>  于2023年8月19日周六 22:38写道：
>>
>>> Hello Celeborn team,
>>>
>>> We are quite close to completing our Celeborn-MR3 client, and I have a
>>> question on speculative execution in the context of using Celeborn.
>>>
>>> MR3 supports speculative execution which allows several task attempts to
>>> run concurrently. When a task attempt succeeds, all other concurrent task
>>> attempts are interrupted and killed, so that only one task attempt
>>> commits
>>> its output.
>>>
>>> When using Celeborn-MR3, speculative execution sometimes seems to corrupt
>>> data sent over to Celeborn. Below I describe a sequence of events that
>>> produce this error. shuffleId, mapId, and partitionId are all fixed,
>>> whereas attemptId can be either 0 or 1.
>>>
>>> 1. Task attempt #1 (with attemptId 0) starts, and calls
>>> ShuffleClient.pushData().
>>>
>>> 2. Task attempt #1 gets stuck at the call of mapperEnd() because
>>> ShuffleClient fails to send data to Celeborn for an unknown reason, while
>>> repeatedly producing INFO messages like:
>>>
>>> 2023-08-19 11:52:16,119 [celeborn-retry-sender-21] INFO
>>> org.apache.celeborn.client.ShuffleClientImpl [] - Revive for push data
>>> success, new location for shuffle 1005007 map 408 attempt 0 partition 0
>>> batch 1 is location PartitionLocation[
>>>id-epoch:0-4
>>>
>>> host-rpcPort-pushPort-fetchPort-replicatePort:192.168.10.103-39861-45968-46540-44091
>>>mode:PRIMARY
>>>peer:(empty)
>>>storage hint:StorageInfo{type=MEMORY, mountPoint='UNKNOWN_DISK',
>>> finalResult=false, filePath=}
>>>mapIdBitMap:null].
>>>
>>> 3. As task attempt #1 does not return for a long time, the speculative
>>> execution mechanism of MR3 kicks in and launches another task attempt
>>> #2 (with attemptId 1).
>>>
>>> 4. Task attempt #2 calls pushData() and succeeds. That is, task attempt
>>> #2
>>> successfully pushes data to Celeborn.
>>>
>>> 5. MR3 interrupts and kills task attempt #1. When this occurs,
>>> mapperEnd()
>>> gets interrupted and prints a message like the following:
>>>
>>> 2023-08-19 11:52:16,089 [DAG-1-5-1] WARN  RuntimeTask [] -
>>> LogicalOutput.close() fails on Reducer 12
>>> org.apache.celeborn.common.exception.CelebornIOException: sleep
>>> interrupted
>>>at
>>>
>>> org.apache.celeborn.common.write.InFlightRequestTracker.limitZeroInFlight(InFlightRequestTracker.java:155)
>>>at
>>>
>>> org.apache.celeborn.common.write.PushState.limitZeroInFlight(PushState.java:85)
>>>at
>>>
>>> org.apache.celeborn.client.ShuffleClientImpl.limitZeroInFlight(ShuffleClientImpl.java:611)
>>>at
>>>
>>> org.apache.celeborn.client.ShuffleClientImpl.mapEndInternal(ShuffleClientImpl.java:1494)
>>>at
>>>
>>> org.apache.celeborn.cl

Re: Question on speculative execution,

2023-08-20 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your mail! For your questions:

1. No, your implementation does not violate the usage of Celeborn-API, and
speculative execution
is supported. Do you call mapperEnd in attempt #2? I think you can kill
attempt #1 after the
invocation of mapperEnd in attempt #2 succeeds.

2. Since speculation execution is allowed, we can safely kill a task
attempt when another attempt
succeeds.

Thanks,
Keyong Zhou

 于2023年8月19日周六 22:38写道：

> Hello Celeborn team,
>
> We are quite close to completing our Celeborn-MR3 client, and I have a
> question on speculative execution in the context of using Celeborn.
>
> MR3 supports speculative execution which allows several task attempts to
> run concurrently. When a task attempt succeeds, all other concurrent task
> attempts are interrupted and killed, so that only one task attempt commits
> its output.
>
> When using Celeborn-MR3, speculative execution sometimes seems to corrupt
> data sent over to Celeborn. Below I describe a sequence of events that
> produce this error. shuffleId, mapId, and partitionId are all fixed,
> whereas attemptId can be either 0 or 1.
>
> 1. Task attempt #1 (with attemptId 0) starts, and calls
> ShuffleClient.pushData().
>
> 2. Task attempt #1 gets stuck at the call of mapperEnd() because
> ShuffleClient fails to send data to Celeborn for an unknown reason, while
> repeatedly producing INFO messages like:
>
> 2023-08-19 11:52:16,119 [celeborn-retry-sender-21] INFO
> org.apache.celeborn.client.ShuffleClientImpl [] - Revive for push data
> success, new location for shuffle 1005007 map 408 attempt 0 partition 0
> batch 1 is location PartitionLocation[
>id-epoch:0-4
>
> host-rpcPort-pushPort-fetchPort-replicatePort:192.168.10.103-39861-45968-46540-44091
>mode:PRIMARY
>peer:(empty)
>storage hint:StorageInfo{type=MEMORY, mountPoint='UNKNOWN_DISK',
> finalResult=false, filePath=}
>mapIdBitMap:null].
>
> 3. As task attempt #1 does not return for a long time, the speculative
> execution mechanism of MR3 kicks in and launches another task attempt
> #2 (with attemptId 1).
>
> 4. Task attempt #2 calls pushData() and succeeds. That is, task attempt #2
> successfully pushes data to Celeborn.
>
> 5. MR3 interrupts and kills task attempt #1. When this occurs, mapperEnd()
> gets interrupted and prints a message like the following:
>
> 2023-08-19 11:52:16,089 [DAG-1-5-1] WARN  RuntimeTask [] -
> LogicalOutput.close() fails on Reducer 12
> org.apache.celeborn.common.exception.CelebornIOException: sleep
> interrupted
>at
>
> org.apache.celeborn.common.write.InFlightRequestTracker.limitZeroInFlight(InFlightRequestTracker.java:155)
>at
>
> org.apache.celeborn.common.write.PushState.limitZeroInFlight(PushState.java:85)
>at
>
> org.apache.celeborn.client.ShuffleClientImpl.limitZeroInFlight(ShuffleClientImpl.java:611)
>at
>
> org.apache.celeborn.client.ShuffleClientImpl.mapEndInternal(ShuffleClientImpl.java:1494)
>at
>
> org.apache.celeborn.client.ShuffleClientImpl.mapperEnd(ShuffleClientImpl.java:1478)
>
> 6. Now, a consumer task attempt tries to read the data pushed by task
> attempt #2. However, it fails to read the data sent by task attempt #2,
> with the following error:
>
> java.io.IOException: Premature EOF from inputStream
>at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:212)
>at
> org.apache.tez.runtime.library.common.shuffle.RssShuffleUtils.shuffleToMemory(RssShuffleUtils.java:47)
>
> Our implementation is quite standard:
>
>inputStream = rssShuffleClient.readPartition(...);
>org.apache.hadoop.io.IOUtils.readFully(inputStream, ..., dataLength);
>
> We double-checked the parameter dataLength and found that it was correctly
> set to the size of data pushed by task attempt #2.
>
> I have two questions:
>
> 1) In the context of using Celeborn, does our implementation violate the
> usage of Celeborn-API? For example, should we prohibit speculative
> execution because Celeborn requires only one task attempt to call
> pushData() at any point of time?
>
> 2) If speculative execution is not allowed, how can we quickly fail
> a task attempt stuck at mapperEnd()? By default, it seems like
> ShufflClient waits for 1200 seconds, not the defaul value of 120 seconds:
>
> 2023-08-19 11:51:32,159 [DAG-1-5-1] ERROR
> org.apache.celeborn.common.write.InFlightRequestTracker [] - After waiting
> for 120 ms, there are still 1 batches in flight for hostAndPushPort
> [192.168.10.106:38993], which exceeds the current limit 0.
>
> Thanks a lot,
>
> --- Sungwoo
>
>

Re: Question on ShuffleClient.readPartition()

2023-08-14 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your mail. For your questions:

1. ShuffleClient.readPartition() may not read chunks in the same order that
they are
created.
2. There is some way to preserve order of batches, at the cost of
decreasing stability.
However, you can increasing the push buffer size to work around the problem.

To be more detail, Celeborn does not guarantee batch order. The reason
is as follows.

First, the default number of netty connections between a ShuffleClient and
a Celeborn
Worker is 2, so although ShuffleClient sends batch1, batch2 in order, the
arrival order to
Worker can be batch2, batch1, because they might be sent through different
connections.
Even if they arrive in the order of batch1, batch2, since they are in
different connections,
they are handled by different netty threads in Worker, so the order they
are written to file
is random. Of course, we can configure the number of connections to 1 by
changing celeborn..io.numConnectionsPerPeer, but we still can't
guarantee order because:

Second, when PushData fails, ShuffleClient will asynchronously resend the
data to a new chosen
Worker (pair). Say ShuffleClient first sends batch1, batch2, and the two
requests both fail,
then ShuffleClient will resend the two batches asynchronously in random
order, and possibly
to different workers, so it can't guarantee their order.

Third, as you said, multiple chunks are fetched at once. But like before,
if we configure the
number of connections to 1, the order of receiving chunks can be ensured in
case no
exception happens.

The situation becomes more complicated if replication is enabled.

Based on the above discussion, I think one way to preserve order can be
achieved by :
1. disable data replication: celeborn.client.push.replicate.enabled=false
2. set number of connections per peer to be one:
celeborn.data.io.numConnectionsPerPeer=1
3. set max retries when push data fail to be
one: celeborn.client.push.revive.maxRetries=1
4. set max retries when fetch data fail to be one:
celeborn.client.fetch.maxRetriesForEachReplica=1
But setting these configs will decrease the stability.

Another alternative is to use a relatively large push buffer size, i.e. 16m:
celeborn.client.push.buffer.max.size=16m.
Data inside a batch data is ordered, we can merge all batches in reduce
tasks. Since
the batch size is relatively large, the cost of merging is fine.

Thanks,
Keyong Zhou


 于2023年8月14日周一 15:09写道：

> Hi Celeborn team,
>
> We are implementing a Celeborn-MR3 client, and have a question on the
> order of chunks returned by ShuffleClient.readPartition().
>
> --- Setup
>
> With shuffleId, mapId, attemptId, partitionId all fixed, suppose that a
> mapper with mapIndex M calls ShuffleClient.pushData() several times in the
> following order:
>
>ShuffleClient.pushData(shuffleId, mapId, attemptId, partitionId,
> batch1, ...)
>ShuffleClient.pushData(shuffleId, mapId, attemptId, partitionId,
> batch2, ...)
>ShuffleClient.pushData(shuffleId, mapId, attemptId, partitionId,
> batch3, ...)
>...
>ShuffleClient.pushData(shuffleId, mapId, attemptId, partitionId,
> batchN, ...)
>
> Then a reducer calls ShuffleClient.readPartition() as follows:
>
>ShuffleClient.readPartition(shuffleId, partitionId, attemptId, M, M + 1)
>
> --- Expectation and result
>
> ShuffleClient.readPartition() should read batches in the same order that
> they are written by calls to pushData(). That is, we expect
> ShuffleClient.readPartition() to return:
>
>batch1, batch2, batch3, ..., batchN
>
> However, the same order is not always guaranteed. For example,
> ShuffleClient.readPartition() may return:
>
>batch3, batch1, batch2, ..., batchN
>
> We find that each batch is written to a Celeborn chunk in its entirely,
> but ShuffleClient.readPartition() does not necessarily read Celeborn
> chunks in the same order that they are created.
>
> --- Problem
>
> For unordered data, the order of batches (or chunks) returned by
> ShuffleClient.readPartition() does not matter.
>
> For ordered data, however, the same order should be enforced because a
> mapper sorts output data before sending it to Celeborn in several batches.
> This is a requirement specific to Tez runtime. (I guess Spark does not
> depend on the order of batches because a reducer sorts all records.)
>
> --- Quick fix
>
> We can set celeborn.shuffle.chunk.size to a large value so that a single
> chunk can accommodate all batches. However, this is not a general solution
> because the max size of mapper output is unknown.
>
> --- Questions
>
> 1. Is it a normal behavior of Celeborn that ShuffleClient.readPartition()
> may not read chunks in the same odder that there are created?
>
> We are confused because TransportClient.java says:
>
> * Multiple fetchChunk requests may

Re: Q. How to interrupt ShuffleClient and avoid revive requests due to HARD_SPLIT

2023-07-31 Thread Keyong Zhou

To be more detailed, ShuffleClient.pushData is a synchronous API, data will
be sent to wire before
return, but callbacks are asynchronously handled in netty's thread pool.

If the callback says that the push data should be retried (for example,
HARD_SPLIT, or push failed), the callback
thread first checks whether the shuffle id that the data belongs to is
already ended, if not, the callback thread
will send Revive to LifecycleManager, who will prepare new workers for the
partition id.

After that, the callback thread will submit a retry push task to
ShuffleClientImpl's pushDataRetryPool,
who will re-push the failed data to the new worker.

So, IMO, it's normal behavior that the driver receives lots of Revive
requests. But if we add the shuffle id in
ShuffleClientImpl's stageEndShuffleSet, then ShuffleClient should not send
the requests any more.

Thanks,
Keyong Zhou

Keyong Zhou  于2023年8月1日周二 11:40写道：

> Hi Sungwoo,
>
> Thanks for your letter and apologize for the late reply :P
>
> For your questions:
>
> 1. No we can't, the thread pool is stopped by calling
> ShuffleClient.shutdown(). In addition,
> The thread pool is shared for all shuffle ids, if you shut it down, the
> ShuffleClient will not
> work properly for other shuffles.
>
> 2. You can try to add the shuffle id in
> ShuffleClientImpl's stageEndShuffleSet. Currently
> ShuffleClient does not have an API like `setStageEnd`, but I think it's
> fine to add one. Let me know if you are interested in sending a PR :)
>
> BTW, if you are using v0.3.0-incubating, I recommend you to patch the
> following PR:
> https://github.com/apache/incubator-celeborn/pull/1755
> It's related to StageEnd logic.
>
> Thanks,
> Keyong Zhou
>
>
>
>  于2023年7月31日周一 10:54写道：
>
>> Hi Celeborn team,
>>
>> We are implementing a Celeborn-MR3 client, and have a question on how to
>> properly unregister a shuffle ID via ShuffleClient. Here is a description
>> of the problem.
>>
>> 1. Suppose that several ShuffleClients are pushing data for a common
>> shuffle ID.
>>
>> 2. For some reason (e.g., a Hive query fails due to OutOfMemoryError or
>> some task fails after several attempts), we decide to interrupt all
>> ShuffleClients.
>>
>> 3. Inside the driver, we call ShuffleClient.unregisterShuffle() with
>> isDriver set to true. Insider MR3 workers, we call
>> ShuffleClient.unregisterShuffle()
>> with isDriver set to false, as well as ShuffleClient.cleanup().
>>
>> Outcome:
>>
>> Insider workers, data push threads continue to run. As a result, the
>> driver
>> keeps receiving revive requests due to HARD_SPLIT.
>>
>> Question:
>>
>> 1. Can we stop data push threads (e.g., celeborn-retry-sender-6) when we
>> call ShuffleClient.unregisterShuffle()?
>>
>> 2. What is a correct way of stopping ShuffleClient for a given shuffle
>> ID?
>> In our experiment, the driver prints thousands of revive request, and we
>> are not sure if this is a normal behavior.
>>
>> Any comment or suggestion will be appreciated very much. Thank you.
>>
>> --- Sungwoo
>>
>>

Re: Q. How to interrupt ShuffleClient and avoid revive requests due to HARD_SPLIT

2023-07-31 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your letter and apologize for the late reply :P

For your questions:

1. No we can't, the thread pool is stopped by calling
ShuffleClient.shutdown(). In addition,
The thread pool is shared for all shuffle ids, if you shut it down, the
ShuffleClient will not
work properly for other shuffles.

2. You can try to add the shuffle id in
ShuffleClientImpl's stageEndShuffleSet. Currently
ShuffleClient does not have an API like `setStageEnd`, but I think it's
fine to add one. Let me know if you are interested in sending a PR :)

BTW, if you are using v0.3.0-incubating, I recommend you to patch the
following PR:
https://github.com/apache/incubator-celeborn/pull/1755
It's related to StageEnd logic.

Thanks,
Keyong Zhou



 于2023年7月31日周一 10:54写道：

> Hi Celeborn team,
>
> We are implementing a Celeborn-MR3 client, and have a question on how to
> properly unregister a shuffle ID via ShuffleClient. Here is a description
> of the problem.
>
> 1. Suppose that several ShuffleClients are pushing data for a common
> shuffle ID.
>
> 2. For some reason (e.g., a Hive query fails due to OutOfMemoryError or
> some task fails after several attempts), we decide to interrupt all
> ShuffleClients.
>
> 3. Inside the driver, we call ShuffleClient.unregisterShuffle() with
> isDriver set to true. Insider MR3 workers, we call
> ShuffleClient.unregisterShuffle()
> with isDriver set to false, as well as ShuffleClient.cleanup().
>
> Outcome:
>
> Insider workers, data push threads continue to run. As a result, the driver
> keeps receiving revive requests due to HARD_SPLIT.
>
> Question:
>
> 1. Can we stop data push threads (e.g., celeborn-retry-sender-6) when we
> call ShuffleClient.unregisterShuffle()?
>
> 2. What is a correct way of stopping ShuffleClient for a given shuffle ID?
> In our experiment, the driver prints thousands of revive request, and we
> are not sure if this is a normal behavior.
>
> Any comment or suggestion will be appreciated very much. Thank you.
>
> --- Sungwoo
>
>

Re: Question of fetching mapper output

2023-07-24 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your update.

Yes this mailing list is the right place to discuss Celeborn, any questions
please
feel free to ask.

Thanks,
Keyong Zhou

 于2023年7月21日周五 13:54写道：

> Hi Keyong,
>
> Unlike Spark/Flink clients, we had to directly modify the MR3 runtime code
> to support Celeborn and thus don't add new code to Celeborn. We release
> the
> MR3 runtime code in Github, which could be used just as an example of
> exploiting Celeborn.
>
> The API is clean and the code is also clearly structured and easy to
> follow, but still we find a few questions ws we test the MR3-Celeborn
> extension, especially for dealing with exceptions (e.g., Premature EOF
> from inputStream). I hope this mailing list is the right place to ask such
> questions.
>
> Best,
>
> --- Sungwoo
>
> On Mon, 17 Jul 2023, Keyong Zhou wrote:
>
> > Hi Sungwoo,
> >
> > It's really great to hear that! To be honest, we never expected such
> things
> > will happen.
> >
> > Just curious, is it possible that you contribute the integration with MR3
> > to Celeborn community?
> > It will be a great feature for Celeborn, also the community can work
> > together to better support MR3 (and Hive).
> >
> > Thanks,
> > Keyong Zhou
> >
> >
> >
> >  于2023年7月16日周日 22:33?道：
> >
> >> We have extended the implementation of MR3 so that all partition
> >> inputs can be fetched with a single call, e.g.:
> >>
> >>rssShuffleClient.readPartition(..., 0, 100)
> >>
> >> Now, Hive-MR3 with Celeborn runs as fast as Hive-MR3 with its own
> shuffle
> >> handlers when tested with 10TB TPC-DS benchmark. For some queries, it is
> >> even noticeably faster.
> >>
> >> Thanks,
> >>
> >> --- Sungwoo
> >>
> >> On Thu, 13 Jul 2023, o...@pl.postech.ac.kr wrote:
> >>
> >>> Hi Team,
> >>>
> >>> I have a question on how a reducer should fetch the output of mappers.
> >>> As an example, consider this standard scenario:
> >>>
> >>> 1. There are 100 mapper and 50 reducers.
> >>> 2. Each mapper creates 50 partitions, each of which is to be fetched by
> >> the
> >>> corresponding reducer.
> >>> 3. Each reducer is responsible for a single partition and tries to
> fetch
> >> 100
> >>> partitions (one from each mapper).
> >>>
> >>> In our current implementation, a reducer calls
> >>> rssShuffleClient.readPartition() 100 times (one for each mapper):
> >>>
> >>>  rssShuffleClient.readPartition(..., mapIndex, mapIndex + 1)
> >>>
> >>> My question is: if reducers start after the completion of all mappers,
> >> can we
> >>> call (or should we try to call) rssShuffleClient.readPartition() only
> >> once,
> >>> as in?
> >>>
> >>>  rssShuffleClient.readPartition(..., 0, 100)
> >>>
> >>> My understanding of remote shuffle service (like Magnet for Spark) is
> >> that
> >>> all the partitions destined to the same reducer are automatically
> merged
> >> by
> >>> the shuffle service, so we thought that just a single call might be
> >> enough.
> >>>
> >>> Thanks,
> >>>
> >>> --- Sungwoo Park
> >>>
> >>
> >

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.0-incubating-rc2

2023-07-19 Thread Keyong Zhou

+1 (binding) I checked - git commit hash is correct. - links are valid. -
"incubating" is in the name. - PGP keys are good. - hashes are correct. -
LICENSE looks good. - NOTICE looks good. - DISCLAIMER exists. - build
success from source code (macOS). ``` ./build/make-distribution.sh
--release ``` Thanks, Keyong Zhou

Zhongqiang Chen  于2023年7月18日周二 18:34写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.3.0-incubating-rc2
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.3.0-incubating-rc2
>
> The git commit hash:
> 6c5e3f8e7021f409b44e6352142d13e7ac3ffe93The source and binary artifacts
> can be found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.3.0-incubating-rc2
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1036
>
> Fingerprint of the PGP key release artifacts are signed with:
> 4A99BC4356A17B2FCFDBF7EE220A9E2898A6A6D4
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Zhongqiang Chen
>
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.3.0-incubating-rc1

2023-07-17 Thread Keyong Zhou

Hi,

I have to -1 for this rc because I found issues[1][2][3] when testing rc1,
and I think it's necessary to merge the improvements before we release.

[1] https://github.com/apache/incubator-celeborn/pull/1720
[2] https://github.com/apache/incubator-celeborn/pull/1722
[3] https://github.com/apache/incubator-celeborn/pull/1719

Thanks,
Keyong Zhou

Re: Question of fetching mapper output

2023-07-16 Thread Keyong Zhou

Hi Sungwoo,

It's really great to hear that! To be honest, we never expected such things
will happen.

Just curious, is it possible that you contribute the integration with MR3
to Celeborn community?
It will be a great feature for Celeborn, also the community can work
together to better support MR3 (and Hive).

Thanks,
Keyong Zhou



 于2023年7月16日周日 22:33写道：

> We have extended the implementation of MR3 so that all partition
> inputs can be fetched with a single call, e.g.:
>
>rssShuffleClient.readPartition(..., 0, 100)
>
> Now, Hive-MR3 with Celeborn runs as fast as Hive-MR3 with its own shuffle
> handlers when tested with 10TB TPC-DS benchmark. For some queries, it is
> even noticeably faster.
>
> Thanks,
>
> --- Sungwoo
>
> On Thu, 13 Jul 2023, o...@pl.postech.ac.kr wrote:
>
> > Hi Team,
> >
> > I have a question on how a reducer should fetch the output of mappers.
> > As an example, consider this standard scenario:
> >
> > 1. There are 100 mapper and 50 reducers.
> > 2. Each mapper creates 50 partitions, each of which is to be fetched by
> the
> > corresponding reducer.
> > 3. Each reducer is responsible for a single partition and tries to fetch
> 100
> > partitions (one from each mapper).
> >
> > In our current implementation, a reducer calls
> > rssShuffleClient.readPartition() 100 times (one for each mapper):
> >
> >  rssShuffleClient.readPartition(..., mapIndex, mapIndex + 1)
> >
> > My question is: if reducers start after the completion of all mappers,
> can we
> > call (or should we try to call) rssShuffleClient.readPartition() only
> once,
> > as in?
> >
> >  rssShuffleClient.readPartition(..., 0, 100)
> >
> > My understanding of remote shuffle service (like Magnet for Spark) is
> that
> > all the partitions destined to the same reducer are automatically merged
> by
> > the shuffle service, so we thought that just a single call might be
> enough.
> >
> > Thanks,
> >
> > --- Sungwoo Park
> >
>

Re: Question on implementing Celeborn client,

2023-07-13 Thread Keyong Zhou

Hi,

If you call endpoint.ask[CommitFilesResponse](message), you should wait for
response. If responses
is successful, you can be sure commit files succeeds. Please refer
to CommitHandler.requestCommitFilesWithRetry.

Thanks,
Keyong Zhou

 于2023年7月13日周四 15:54写道：

> > Following are the main steps for a shuffle stage:
> > 1. LifecycleManager sends RequestSlots to Master to request slots for the
> > current shuffle;
> > 2. Master allocates slots among workers for the shuffle and
> > returns RequestSlotsResponse;
> > 3. LifecycleManager sends ReserveSlots to workers; workers do
> > initialization;
> > 4. ShuffleClient pushes data to workers;
> > 5. When map task ends, ShuffleClient sends MapperEnd to LifecycleManager;
> > 6. When all map tasks ended, LifecycleManager sends CommitFiles to
> workers;
> > 7. When CommitFiles succeeds, reducer tasks can read data from workers.
>
> Hello,
>
> Is there some way to use Celeborn API to check if CommitFiles succeeds in
> step 6? Currently we are testing with TPC-DS 10TB data, and some heavy
> query (query 24) occasionally fails with:
>
>Caused by: java.io.IOException: Premature EOF from inputStream
>
> We are speculating that this error occurs because we miss the check in
> step 6.
>
> Thanks,
>
> --- Sungwoo
>
>

Re: Question on implementing Celeborn client,

2023-07-12 Thread Keyong Zhou

Hi Sungwoo,

Glad to know about your progress! For your questions,

1. In Celeborn's default implementation, ShuffleClient is a singleton in
the Executor and Driver process, I suggest to follow this practice.
It's recommended to call ShuffleClient.cleanup(int shuffleId, int
mapId, int attemptId) after each writer finishes because it cleans up
the PushState related to the specific map attempt.

2. ShuffleClient.shutdown should be called inside ShuffleManager.stop.
Better to call it in both Executor and Driver. If Executor
is to exit, it's fine not to call this method.

Thanks,
Keyong Zhou

 于2023年7月12日周三 21:19写道：

> Hi Keyong,
>
> Thanks for your quick reply. We thought that Celeborn API was clean and
> very intuitive, and have not encountered serious problems yet for getting
> our system up and running. We are not sure about just a few points that
> are not immediately obvious from Celeborn API (e.g., whether or not
> reducers should wait until the completion of source mappers).
>
> I have a few more questions on when to create/destroy rssShuffleClient and
> will appreciate it very much if you could clarify these points.
>
> 1. When to create rssShuffleClient and when to call
> rssShuffleClient.cleanup()?
>
> In our implementation, a worker (similar to Executors for Spark) creates a
> new rssShuffleClient for each reader/writer, rather than reusing a common
> rssShuffleClient for all readers/writers. Is this the right way to create
> rssShuffleClient?
>
> Then, rssShuffleClient.cleanup() should be called after a reader/writer is
> finished?
>
> 2. When to call rssShuffleClient.shutdown()?
>
> Is it enough to call rssShuffleClient.shutdown() only once inside the
> master (similar to Driver for Spark)?
>
> Thanks,
>
> --- Sungwoo
>
> On Wed, 12 Jul 2023, Keyong Zhou wrote:
>
> > Hi Sungwoo,
> >
> > Thanks for your effort to integrating Celeborn into MR3!
> >
> > For your question, currently a reducer does wait until the completion of
> > all mappers
> > before starting to fetch shuffle data.
> >
> > Briefly speaking, Celeborn client contains two modules:
> > 1. ShuffleClient for push/fetch data, mainly used on Executors for Spark
> > and TaskManager for Flink.
> > 2. LifecycleManager for communicating with Celeborn cluster and managing
> > application-level shuffle meta,
> >mainly used in Driver for Spark or JobMaster for Flink.
> >
> > Following are the main steps for a shuffle stage:
> > 1. LifecycleManager sends RequestSlots to Master to request slots for the
> > current shuffle;
> > 2. Master allocates slots among workers for the shuffle and
> > returns RequestSlotsResponse;
> > 3. LifecycleManager sends ReserveSlots to workers; workers do
> > initialization;
> > 4. ShuffleClient pushes data to workers;
> > 5. When map task ends, ShuffleClient sends MapperEnd to LifecycleManager;
> > 6. When all map tasks ended, LifecycleManager sends CommitFiles to
> workers;
> > 7. When CommitFiles succeeds, reducer tasks can read data from workers.
> >
> > We have to admit that although currently Celeborn supports both Flink and
> > Spark based on the same API, the
> > developer API is not that much clean. It will be very helpful if you send
> > PRs to improve Celeborn during your
> > integration with MR3.
> >
> > Thanks,
> > Keyong Zhou
> >
> >
> >  于2023年7月12日周三 14:53?道：
> >
> >> Hi Team,
> >>
> >> We are currently implementing a Celeborn client for our application
> >> (called MR3 which is similar to Tez), and have a question on the
> internals
> >> of Celeborn.
> >>
> >> The question is whether a reducer should wait until the completion of
> all
> >> mappers before starting to fetch mapper output. From the Celeborn API,
> it
> >> seems like there is no need to wait until the completion of all mappers.
> >> In other words, after a certain mapper finishes writing all its output,
> a
> >> reducer can fetch the corresponding output from the mapper, regardless
> of
> >> the status of other mappers.
> >>
> >> On the other hand, we suspect that trying to fetch the output of a
> mapper
> >> before the completion of other mappers occasionally triggers Premature
> EOF
> >> Exception.
> >>
> >> Any comment on this problem will be appreciated very much.
> >>
> >> Thanks,
> >>
> >> --- Sungwoo Park
> >>
> >>
> >

Re: Question on implementing Celeborn client,

2023-07-12 Thread Keyong Zhou

Hi Sungwoo,

Thanks for your effort to integrating Celeborn into MR3!

For your question, currently a reducer does wait until the completion of
all mappers
before starting to fetch shuffle data.

Briefly speaking, Celeborn client contains two modules:
1. ShuffleClient for push/fetch data, mainly used on Executors for Spark
and TaskManager for Flink.
2. LifecycleManager for communicating with Celeborn cluster and managing
application-level shuffle meta,
mainly used in Driver for Spark or JobMaster for Flink.

Following are the main steps for a shuffle stage:
1. LifecycleManager sends RequestSlots to Master to request slots for the
current shuffle;
2. Master allocates slots among workers for the shuffle and
returns RequestSlotsResponse;
3. LifecycleManager sends ReserveSlots to workers; workers do
initialization;
4. ShuffleClient pushes data to workers;
5. When map task ends, ShuffleClient sends MapperEnd to LifecycleManager;
6. When all map tasks ended, LifecycleManager sends CommitFiles to workers;
7. When CommitFiles succeeds, reducer tasks can read data from workers.

We have to admit that although currently Celeborn supports both Flink and
Spark based on the same API, the
developer API is not that much clean. It will be very helpful if you send
PRs to improve Celeborn during your
integration with MR3.

Thanks,
Keyong Zhou


 于2023年7月12日周三 14:53写道：

> Hi Team,
>
> We are currently implementing a Celeborn client for our application
> (called MR3 which is similar to Tez), and have a question on the internals
> of Celeborn.
>
> The question is whether a reducer should wait until the completion of all
> mappers before starting to fetch mapper output. From the Celeborn API, it
> seems like there is no need to wait until the completion of all mappers.
> In other words, after a certain mapper finishes writing all its output, a
> reducer can fetch the corresponding output from the mapper, regardless of
> the status of other mappers.
>
> On the other hand, we suspect that trying to fetch the output of a mapper
> before the completion of other mappers occasionally triggers Premature EOF
> Exception.
>
> Any comment on this problem will be appreciated very much.
>
> Thanks,
>
> --- Sungwoo Park
>
>

Re: [DISCUSSION] Release Apache Celeborn(Incubating) 0.3.0-incubating-rc0

2023-07-02 Thread Keyong Zhou

Thanks Zhongqiang Chen for being our release manager for 0.3.0! I have no
problem with releasing this version.

I agree with Cheng Pan, we can prepare the release note first.

Thanks,
Keyong Zhou

Cheng Pan  于2023年6月30日周五 15:18写道：

> Thanks Zhongqiang for driving this release.
>
> I think it’s good to start the first RC next week, in the meanwhile, we
> can prepare the release note first, you may want to create a Google
> Docs/GitHub discussion or other online docs to make it easy for review and
> collaboration.
>
> Thanks,
> Cheng Pan
>
>
>
>
> > On Jun 30, 2023, at 12:28, Zhongqiang Chen 
> wrote:
> >
> >
> >
> >
> > Hi Celeborn community,
> >
> >
> >
> >
> > I am pleased to announce that we are planning to release version
> 0.3.0-incubating-rc0 of Apache Celeborn (Incubating)!
> >
> > Since version 0.2.1, 28+ contributors have submitted 440+ commits. The
> most essential features are firstly, unified shuffle service supports both
> spark and flink, and secondly, significant upgrades in stability and
> performance
> >
> >
> >
> >
> > Before we proceed with the release, I’d like to start a discussion to
> ensure that everything is in order and to address any outstanding issues.
> >
> > Please take a moment to review the current state of the project and
> share any concerns or questions you may have. Here are a few items that we
> would like to discuss:
> >
> >
> >
> >
> > 1. Are there any outstanding bugs that need to be addressed before the
> release?
> >
> >
> >
> >
> > 2. Are there any compatibility issues that we should be aware of?
> >
> >   now we have prepared a migration doc from 0.2.1 to 0.3.0
> https://github.com/apache/incubator-celeborn/blob/main/docs/migration.md
> >
> >
> >
> >
> > 3. Are there any new features that should be included in this release?
> >
> >
> >
> >
> > If you have any thoughts on these topics, or anything else related to
> the release, please reply to this email with your feedback. Your input is
> greatly appreciated!
> >
> > We plan to freeze code next week and then release version 0.3.0 of
> Apache Celeborn (Incubating)!, so please provide your feedback as soon as
> possible. Thank you for your contributions to this project, and I look
> forward to hearing from you soon.
> >
> >
> >
> >
> > Best regards,
> >
> > Zhongqiang Chen
>
>
>

Re: [DISCUSS] Allow external contributors to run CI without approval

2023-06-16 Thread Keyong Zhou

+1

Thanks,
Keyong Zhou

Ethan Feng  于2023年6月16日周五 16:27写道：

> Recent moves by Apache Infra have changed the policy on GitHub Actions from
> "Only requires approval first time" to "Requires approval every time".
>
> I think this is not friendly for getting folks involved in
> the project and this increased the cost for committers to process the
> pull requests.
>
> Please respond to this thread if you are in support of going back to
> "Only requires approval the first time" or if you don't believe this is a
> good idea please respond as well.
>
> Thanks,
> Ethan Feng
>

Re: Committers: Please use `dev/merge_pr.py` to merge new PRs

2023-06-02 Thread Keyong Zhou

Thanks @Cheng Pan  for introducing this nice tool!

Keyong Zhou

Cheng Pan  于2023年6月2日周五 22:38写道：

> Hi Celeborn Committers,
>
> A PR merge tool `dev/merge_pr.py` was added to the Celeborn git
> repo[1][2], it aims to simplify the PR merge and backport process, and
> improve the git commit history.
>
> AFAIK, this tool is originally from Apache Parquet, then, it was borrowed
> and modified by Apache Spark, Apache Kyuubi.
>
> Compare with the current way we are using, use GitHub squash to merge PR,
> this tool has the following pros:
>
> - squash all commits into one, this is the same as GitHub squash, but also
> fix the PR title in canonicalize format:
>   [CELEBORN-][LABEL] Title of the pull request
> - it’s an interactive script, the committer just needs to follow the
> prompt and react with y/n, then can complete the PR backport,
>   JIRA assignment, JIRA fixed version update
> - reserve the author and committer information in the git history, the
> GitHub squash always marks the committer as
>   GitHub , which is really bad.
> - reserve the PR description as the git commit message body, again, please
> fill in the PR description seriously :)
>
> Some setup procedures are required to use this merge tool:
>
> 1. invoking it at the celeborn git project root dir
> 2. naming the upstream repo as “apache”, you can check it by `git remote
> -v`, the desired output should be like
> apache-celeborn git:(main) git remote -v
> apache g...@github.com:apache/incubator-celeborn.git (fetch)
> apache g...@github.com:apache/incubator-celeborn.git (push)
> 3. export environment variables ASF_USERNAME ASF_PASSWORD which is used
> for JIRA authentication
> 4. export environment variable GITHUB_OAUTH_KEY which is used for GitHub
> authentication.
> You can create an OAuth key at https://github.com/settings/tokens
> 5. install python JIRA dependencies by `pip install jira` or `pip install
> -r requirements.txt`
>
> As an example, [3] is merged by this tool.
>
> Please let me know if you have any questions or concerns w/ this tool.
>
> [1] https://issues.apache.org/jira/browse/CELEBORN-623
> [2] https://github.com/apache/incubator-celeborn/pull/1539
> [3]
> https://github.com/apache/incubator-celeborn/commit/67762783d0d51acb147f35e9a59e2bfec48ec04b
>
> Thanks,
> Cheng Pan
>
>
>
>
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.1-incubating-rc0

2023-03-20 Thread Keyong Zhou

+1 (binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou

On 2023/03/17 09:17:25 rexxiong wrote:
> Hi Celeborn community,
> 
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.2.1-incubating-rc0
> 
> The git tag to be voted upon:
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.1-incubating-rc0
> 
> The git commit hash:
> 93898d0020899ba7e98ededaad14a6043cec07c9
> 
> The source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.1-incubating-rc0
> 
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1010
> 
> Fingerprint of the PGP key release artifacts are signed with:
> B4AF9302A52006F3711C784388BDDBB8C6724EC9
> 
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> 
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
> 
> Please vote accordingly:
> 
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
> 
> Checklist for release:
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
> 
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
> 
> Thanks,
> rexxiong
>

[ANNOUNCE] Add zhongqiangchen(Zhongqiang Chen) as new committer

2023-03-14 Thread Keyong Zhou

Hi Celeborn(-incubating) community,

I'm very excited to announce that recently we
added zhongqiangchen(Zhongqiang Chen) as our new committer!

zhongqiangchen has kept contributing to Celeborn for near five months,
mainly on the support of Flink.
Looking forward that zhongqiangchen will continue contributing to the
project, pushing Celeborn to the next level together with all contributors
of the community!

Also, we are looking forward to add more and more committers to our project
:)

Thanks!
Keyong Zhou

[ANNOUNCE] Add rexxiong(Jiashu Xiong) as new committer

2023-03-14 Thread Keyong Zhou

Hi Celeborn(-incubating) community,

I'm very excited to announce that recently we added rexxiong(Jiashu Xiong)
as our new committer!

rexxiong has kept contributing to Celeborn for near five months, mainly on
the support of Flink.
Looking forward that rexxiong will continue contributing to the project,
pushing Celeborn to the next level together with all contributors of the
community!

Also, we are looking forward to add more and more committers to our project
:)

Thanks!
Keyong Zhou

Re: [NOTICE] Fix solution about rare data loss in release 0.2.0.

2023-03-08 Thread keyong zhou

Hi Yu,

We do have a plan for a quick fix, before that we'd like to do more tests
and
collect more feedbacks for about a week.

Thanks,
Keyong Zhou

Yu Li  于2023年3月9日周四 13:48写道：

> Thanks for the note Ethan.
>
> I'm not sure but maybe it is worth a quick bug fix release, i.e. 0.2.1? Any
> plan for that?
>
> Best Regards,
> Yu
>
>
> On Wed, 8 Mar 2023 at 11:55, Ethan Feng 
> wrote:
>
> > Hello users,
> > Regretfully to inform you that we found a bug[2] in release 0.2.0
> > yesterday. The bug[2] caused data loss rarely when reading from skew
> > partitions on a high-pressure cluster.
> > You need to apply this patch[1] to your Celeborn client jar. We'll
> > ship this patch in our next release.
> > Feel free to contact us if you encounter any other questions.
> >
> > Regards,
> > Ethan Feng
> >
> > ---
> > 1. https://github.com/apache/incubator-celeborn/pull/1315
> > 2. https://issues.apache.org/jira/browse/CELEBORN-383
> >
>

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-03 Thread Keyong Zhou

Hi Yu,

Thanks for the reminder, we have already fixed it :)

https://github.com/apache/incubator-celeborn/commit/9aabb43699225d47c1470027b98a42210df914e8
https://github.com/apache/incubator-celeborn/commit/dcf1e018f6352a64250c64d64e21e3eae1f8fa14

Thanks,
Keyong Zhou

Yu Li  于2023年3月3日周五 18:03写道：

> Hi all,
>
> Since Spark has already taken action to fix the issue, do we have a PR to
> fix ours? Thanks.
>
> Some background for easier reference:
> https://lists.apache.org/thread/6mcwfb2gmxpj1r281pd13gs35wjqfxom
>
> Best Regards,
> Yu
>
>
> -- Forwarded message -
> From: Dongjoon Hyun 
> Date: Thu, 2 Mar 2023 at 16:14
> Subject: Re: [Question] LimitedInputStream license issue in Spark source.
> To: 
> Cc: incubator general apache , Cheng Pan <
> cheng...@apache.org>, Ethan Feng , Willem Jiang <
> ningji...@apache.org>, 
>
>
> Thank you. Here is the PR to fix that.
>
> https://github.com/apache/spark/pull/40249
> [SPARK-42649][CORE] Remove the standard Apache License header from the top
> of third-party source files
>
> Dongjoon.
>
>
> On Wed, Mar 1, 2023 at 11:53 PM  wrote:
>
> > Hi,
> >
> > See https://www.apache.org/legal/src-headers.html#3party - "Do not add
> > the standard Apache License header to the top of third-party source
> files.”
> > and "Minor modifications/additions to third-party source files should
> > typically be licensed under the same terms as the rest of the third-party
> > source for convenience.”
> >
> > Kind Regards,
> > Justin
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc5

2023-02-21 Thread Keyong Zhou

 +1 (binding)

I checked

- links are valid.
- "incubating" is in the name.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.

- signatures are good.
```
gpg --verify apache-celeborn-0.2.0-incubating-source.tgz.asc
gpg --verify apache-celeborn-0.2.0-incubating-bin.tgz.asc
```

- checksums are good.
```
sha512sum --check apache-celeborn-0.2.0-incubating-bin.tgz.sha512
sha512sum --check apache-celeborn-0.2.0-incubating-source.tgz.sha512
```

- build success from source code (macos).
```
build/make-distribution.sh --release
```

Ethan Feng  于2023年2月20日周一 23:28写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.2.0-incubating-rc5
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc5
>
> The git commit hash:
> 52527756c3d2262cfd3e313272b82216c505c7e2
>
> The source and binary artifacts can be found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc5/
>
> Fingerprint of the PGP key release artifacts are signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> Starting with my +1 (binding):
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Ethan Feng
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc4

2023-02-07 Thread Keyong Zhou

+1 (non-binding) I checked - git commit hash is correct. - links are valid.
- "incubating" is in the name. - PGP keys are good. - hashes are correct. -
LICENSE looks good. - NOTICE looks good. - DISCLAIMER exists. - build
success from source code (macOS). ``` ./build/make-distribution.sh
--release ``` Thanks, Keyong Zhou

Ethan Feng  于2023年2月8日周三 10:37写道：

> Hello Incubator Community,
>
> This is a call for a vote to release Apache Celeborn(Incubating)
> version 0.2.0-incubating-rc4
>
> The Apache Celeborn community has voted on and approved a proposal to
> release
> Apache Celeborn(Incubating) version 0.2.0-incubating-rc4
>
> We now kindly request the Incubator PMC members review and vote on this
> incubator release.
>
> Celeborn community vote thread:
> • https://lists.apache.org/thread/3qv3byyy1rqv7l9qsx02gbto1n9ymd1h
>
> Vote result thread:
> • https://lists.apache.org/thread/k4jvqdd0dwk9dc7t6n80tgqv99fmwok7
>
> The release candidate:
> •
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc4
>
> Git tag for the release:
> •
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc4
>
> Public keys file:
> • https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The change log is available in:
> •
> https://github.com/apache/incubator-celeborn/compare/v0.1.4...v0.2.0-incubating-rc4
>
> The vote will be open for at least 72 hours or until the necessary number
> of votes are reached.
>
> Please vote accordingly:
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> More detailed checklist please refer:
> •
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
>
> Steps to validate the release， Please refer to:
> • https://www.apache.org/info/verification.html
>
> Thanks,
> On behalf of Apache Celeborn(Incubating) community
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc4

2023-02-06 Thread Keyong Zhou

+1 (binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou

Ethan Feng  于2023年2月4日周六 21:29写道：

> Hi Celeborn community,
>
> This is a call for a vote to release Apache Celeborn (Incubating)
> 0.2.0-incubating-rc4
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc4
>
> The git commit hash:
> 282e0b0bbc76fa9339e931b0b252b6cbb16dddf5
>
> The source and binary artifacts can be found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc4/
>
> Fingerprint of the PGP key release artifacts are signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> Starting with my +1 (binding):
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Ethan Feng
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc3

2023-01-18 Thread Keyong Zhou

+1 (binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- PGP keys are good.
- hashes are correct.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou


Ethan Feng  于2023年1月18日周三 22:01写道：

> Hi Celeborn community,
>
> This is a call for the vote to release Apache Celeborn (Incubating)
> 0.2.0-incubating-rc3
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc3
>
> The git commit hash:
> 98b356f599b9e8960d755cb6add8f6934345d2a8
>
> The source and binary artifacts can be found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc3/
>
> The fingerprint of the PGP key release artifacts is signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> Starting with my +1 (binding):
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Ethan Feng
>

Call for UT

2023-01-11 Thread Keyong Zhou

Hi community,

Currently the code coverage is quite low, I think it's time to boost the UT
coverage, any effort will be appreciated, thanks!

Thanks,
Keyong Zhou

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc2

2023-01-08 Thread keyong zhou

+1 (non-binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou

Ethan Feng  于2023年1月9日周一 12:08写道：

> Hello Incubator Community,
>
> This is a call for a vote to release Apache Celeborn(Incubating) version
> 0.2.0-incubating-rc2
>
> The Apache Celeborn community has voted on and approved a proposal to
> release
> Apache Celeborn(Incubating) version 0.2.0-incubating-rc2
>
> We now kindly request the Incubator PMC members review and vote on this
> incubator release.
>
> Celeborn community vote thread:
> • https://lists.apache.org/thread/v4t7r6h8043s0hvhhvlzhb35nr9gvshr
>
> Vote result thread:
> • https://lists.apache.org/thread/7n0gw4dz8d5p852sfok9f61cqods6h76
>
> The release candidate:
> •
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc2
>
> Git tag for the release:
> •
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc2
>
> Public keys file:
> • https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The change log is available in:
> •
>
> https://github.com/apache/incubator-celeborn/compare/v0.1.4...v0.2.0-incubating-rc2
>
> The vote will be open for at least 72 hours or until the necessary number
> of votes are reached.
>
> Please vote accordingly:
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> More detailed checklist please refer:
> •
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
>
> Steps to validate the release， Please refer to:
> • https://www.apache.org/info/verification.html
>
> Thanks,
> On behalf of Apache Celeborn(Incubating) community
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc2

2023-01-08 Thread keyong zhou

+1 (non-binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou

Ethan Feng  于2023年1月9日周一 12:08写道：

> Hello Incubator Community,
>
> This is a call for a vote to release Apache Celeborn(Incubating) version
> 0.2.0-incubating-rc2
>
> The Apache Celeborn community has voted on and approved a proposal to
> release
> Apache Celeborn(Incubating) version 0.2.0-incubating-rc2
>
> We now kindly request the Incubator PMC members review and vote on this
> incubator release.
>
> Celeborn community vote thread:
> • https://lists.apache.org/thread/v4t7r6h8043s0hvhhvlzhb35nr9gvshr
>
> Vote result thread:
> • https://lists.apache.org/thread/7n0gw4dz8d5p852sfok9f61cqods6h76
>
> The release candidate:
> •
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc2
>
> Git tag for the release:
> •
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc2
>
> Public keys file:
> • https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The change log is available in:
> •
>
> https://github.com/apache/incubator-celeborn/compare/v0.1.4...v0.2.0-incubating-rc2
>
> The vote will be open for at least 72 hours or until the necessary number
> of votes are reached.
>
> Please vote accordingly:
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> More detailed checklist please refer:
> •
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
>
> Steps to validate the release， Please refer to:
> • https://www.apache.org/info/verification.html
>
> Thanks,
> On behalf of Apache Celeborn(Incubating) community
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc2

2023-01-05 Thread Keyong Zhou

+1 (binding)

I checked

- git commit hash is correct.
- links are valid.
- "incubating" is in the name.
- LICENSE looks good.
- NOTICE looks good.
- DISCLAIMER exists.
- build success from source code (macOS).

```
./build/make-distribution.sh --release
```

Thanks,
Keyong Zhou

Cheng Pan  于2023年1月5日周四 16:08写道：

>  +1 (binding)
>
> I checked
>
> - links are valid.
> - "incubating" is in the name.
> - LICENSE looks good.
> - NOTICE looks good.
> - DISCLAIMER exists.
> - signatures are good.
> ```
> wget https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> gpg --import KEYS
> gpg --verify apache-celeborn-0.2.0-incubating-source.tgz.asc
> gpg --verify apache-celeborn-0.2.0-incubating-bin.tgz.asc
> ```
> - checksums are good.
> ```
> sha512sum --check apache-celeborn-0.2.0-incubating-bin.tgz.sha512
> sha512sum --check apache-celeborn-0.2.0-incubating-source.tgz.sha512
> ```
>
> - build success from source code (openjdk-8, macOS aarch64).
>
> ```
> build/make-distribution.sh --release
> ```
>
> Thanks,
> Cheng Pan
>
>
> On Jan 5, 2023 at 16:04:00, Ethan Feng  wrote:
>
> > Hi Celeborn community,
> >
> > This is a call for vote to release Apache Celeborn (Incubating)
> > 0.2.0-incubating-rc2
> >
> > The git tag to be voted upon:
> >
> >
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc2
> >
> > The git commit hash:
> > fcd1e8ac8723859648e4597092328fa1ed3b96c2
> >
> > The source and binary artifacts can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc2/
> >
> > Fingerprint of the PGP key release artifacts are signed with:
> > FCF20BB29C7BEFDF58F998F76392F71F37356FA0
> >
> > My public key to verify signatures can be found in:
> > https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
> >
> > The vote will be open for at least 72 hours or until necessary
> > number of votes are reached.
> >
> > Please vote accordingly:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove (and the reason)
> >
> > Checklist for release:
> >
> >
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> > Steps to validate the release:
> > https://www.apache.org/info/verification.html
> >
> > Starting with my +1 (binding):
> >
> > * Download links, checksums and PGP signatures are valid.
> > * Source code distributions have correct names matching the current
> > release.
> > * Release files have the word incubating in their name.
> > * DISCLAIMER, LICENSE and NOTICE files are correct.
> > * All files have license headers if necessary.
> > * No unlicensed compiled archives bundled in source archive.
> > * The source tarball matches the git tag.
> > * Build from source is successful.
> >
> > Thanks,
> > Ethan Feng
> >
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc1

2023-01-03 Thread Keyong Zhou

Also, I think we should not include images in source tarball.

Thanks,
Keyong Zhou

Keyong Zhou  于2023年1月4日周三 11:30写道：

> Hi Feng,
>
> When I tried to decompress the apache-celeborn-0.2.0-incubating-bin.tgz on
> CentOS I got the following error:
>
> ._apache-celeborn-0.2.0-incubating-bin
> tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’
> apache-celeborn-0.2.0-incubating-bin/
> apache-celeborn-0.2.0-incubating-bin/._jars
> tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’
> apache-celeborn-0.2.0-incubating-bin/jars/
> apache-celeborn-0.2.0-incubating-bin/._docker
> tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’
>
>
> I think we should re-build the tarball and restart the vote.
>
> Thanks,
> Keyong Zhou
>
> Ethan Feng  于2023年1月3日周二 21:00写道：
>
>> Hi Celeborn community,
>>
>> This is a call for vote to release Apache Celeborn (Incubating)
>> 0.2.0-incubating-rc1
>>
>> The git tag to be voted upon:
>>
>> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc1
>>
>> The git commit hash:
>> 7c0664ccd0c296cb14b63d2f0c9733c0deb538b9
>>
>> The source and binary artifacts can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc1/
>>
>> Fingerprint of the PGP key release artifacts are signed with:
>> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>>
>> My public key to verify signatures can be found in:
>> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>>
>> The vote will be open for at least 72 hours or until necessary
>> number of votes are reached.
>>
>> Please vote accordingly:
>>
>> [ ] +1 approve
>> [ ] +0 no opinion
>> [ ] -1 disapprove (and the reason)
>>
>> Checklist for release:
>>
>> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
>> Steps to validate the release:
>> https://www.apache.org/info/verification.html
>>
>> Starting with my +1 (binding):
>>
>> * Download links, checksums and PGP signatures are valid.
>> * Source code distributions have correct names matching the current
>> release.
>> * Release files have the word incubating in their name.
>> * DISCLAIMER, LICENSE and NOTICE files are correct.
>> * All files have license headers if necessary.
>> * No unlicensed compiled archives bundled in source archive.
>> * The source tarball matches the git tag.
>> * Build from source is successful.
>>
>> Thanks,
>> Ethan Feng
>>
>

Re: [VOTE] Release Apache Celeborn(Incubating) 0.2.0-incubating-rc1

2023-01-03 Thread Keyong Zhou

Hi Feng,

When I tried to decompress the apache-celeborn-0.2.0-incubating-bin.tgz on
CentOS I got the following error:

._apache-celeborn-0.2.0-incubating-bin
tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’
apache-celeborn-0.2.0-incubating-bin/
apache-celeborn-0.2.0-incubating-bin/._jars
tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’
apache-celeborn-0.2.0-incubating-bin/jars/
apache-celeborn-0.2.0-incubating-bin/._docker
tar: 忽略未知的扩展头关键字‘LIBARCHIVE.xattr.com.apple.provenance’


I think we should re-build the tarball and restart the vote.

Thanks,
Keyong Zhou

Ethan Feng  于2023年1月3日周二 21:00写道：

> Hi Celeborn community,
>
> This is a call for vote to release Apache Celeborn (Incubating)
> 0.2.0-incubating-rc1
>
> The git tag to be voted upon:
>
> https://github.com/apache/incubator-celeborn/releases/tag/v0.2.0-incubating-rc1
>
> The git commit hash:
> 7c0664ccd0c296cb14b63d2f0c9733c0deb538b9
>
> The source and binary artifacts can be found at:
>
> https://dist.apache.org/repos/dist/dev/incubator/celeborn/v0.2.0-incubating-rc1/
>
> Fingerprint of the PGP key release artifacts are signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/incubator/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Checklist for release:
>
> https://cwiki.apache.org/confluence/display/INCUBATOR/Incubator+Release+Checklist
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> Starting with my +1 (binding):
>
> * Download links, checksums and PGP signatures are valid.
> * Source code distributions have correct names matching the current
> release.
> * Release files have the word incubating in their name.
> * DISCLAIMER, LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> Thanks,
> Ethan Feng
>

1 2 >

1 - 100 of 116 matches

Mail list logo