[VOTE] Release Apache Celeborn 0.5.0-rc2

2024-06-11 Thread Ethan Feng
Hello, Celeborn community,

This is a call for a vote to release Apache Celeborn
0.5.0-rc2

The git tag to be voted upon:
https://github.com/apache/celeborn/releases/tag/v0.5.0-rc2

Source and binary artifacts can be found at:
https://dist.apache.org/repos/dist/dev/celeborn/v0.5.0-rc2

The git commit hash:
68c503eb0023e274f8ae09bf4c2687f6a0c01a25

The staging repo:
https://repository.apache.org/content/repositories/orgapacheceleborn-1075/

The fingerprint of the PGP key release artifacts is signed with:
FCF20BB29C7BEFDF58F998F76392F71F37356FA0

My public key to verify signatures can be found in:
https://dist.apache.org/repos/dist/release/celeborn/KEYS

The vote will be open for at least 72 hours or until the necessary
number of votes are reached.

Please vote accordingly:

[ ] +1 approve
[ ] +0 no opinion
[ ] -1 disapprove (and the reason)

Steps to validate the release:
https://www.apache.org/info/verification.html

* Download links, checksums, and PGP signatures are valid.
* Source code distributions have correct names matching the current release.
* LICENSE and NOTICE files are correct.
* All files have license headers if necessary.
* No unlicensed compiled archives bundled in the source archive.
* The source tarball matches the git tag.
* Build from source is successful.

There are additional tests:
* Performance test no regression
1 TB TPC-DS, 0.5.0 VS 0.4.1 : 2042(s) VS 2050(s)
1.1 TB pure shuffle, 0.5.0 VS 0.4.1 : 11.8min vs 11.8min

* Result correctness test passed
1TB TPC-DS runs concurrently, the results are identical.

* Usability test passed
Rolling upgrade from version 0.4.1 to 0.5.0 succeed.
The metrics system works as expected.

* Stability test passed
Random worker failures, Celeborn works as expected.
Random master failures, Celeborn works as expected.
Master meta corrupted, Celeborn works as expected.

* Compatibility test passed
The Celeborn server version of 0.5.0 works fine with the Celeborn client 0.4.1.


Regards,
Ethan Feng


Re: Re: [VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Keyong Zhou
+1

Thanks for the proposal!

Regards,
Keyong Zhou

Nicholas Jiang  于2024年6月12日周三 13:02写道:

> +1. Looking forward to Celeborn CLI.
>
>
>
>
> Regards,
>
> Nicholas Jiang
>
>
> At 2024-06-12 12:26:34, "Aravind Patnam"  wrote:
> >Hi all,
> >
> >Sorry, this is the correct link to the Celeborn CLI CIP
> ><
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI>
> >.
> >
> >Thanks,
> >Aravind
> >
> >On Tue, Jun 11, 2024 at 9:24 PM Aravind Patnam 
> wrote:
> >
> >> Hi all,
> >>
> >> This is a call to vote to contribute the Celeborn CLI CIP
> >> <
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals>
> to
> >> Apache Celeborn.
> >>
> >> Please do vote accordingly:
> >> [ ] +1 approve
> >> [ ] +0 no opinion
> >> [ ] -1 disapprove (and the reason)
> >>
> >> Thanks once again!!
> >>
> >> Aravind
> >>
> >
> >
> >--
> >Aravind K. Patnam
>


Re: [DISCUSSION] CIP-6: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-11 Thread Nicholas Jiang
Hi Yuxin,

Thanks for the explanation of above question. IMO, for the implementation of 
Celeborn, more design details need to be provided in CIP rather than FLIP for 
reviewing of community developers. Meanwhile, although some public 
configurations are configured in Flink, for Celeborn Flink Client, they still 
need to be additionally exposed in CIP so that reviewer does not have to spend 
time looking at FLIP. Anyway, I got the answer from your detailed reply. Thanks.

+1 for me. Looking forward to this integration. I would like to consider to use 
this feature after ready.

Regards,
Nicholas Jiang

On 2024/06/11 08:09:06 Yuxin Tan wrote:
> Hi Nicholas,
> 
> Thanks for the valuable feedbacks.
> 
> > 1.  Could you describe in detail what functions the relevant components
> mentioned in Proposed Changes
> 
> These components are only the pluggable implementations of the Celeborn
> tier.
> The details and the mechanisms of switching between tiers are in the
> previous
> FLIP[1]. The Celeborn, as a new tier, is added to hybrid shuffle, sharing
> the
> similarities with existing tiers, such as the Memory tier and Disk tier. In
> this tiered
> storage, agents serve as the entry points of interaction between the
> framework
> and different tiers. For instance, CelebornProducerAgent acts as the entry
> point
> for producers to emit data into the tier. If there are still more similar
> questions
> after referencing that FLIP, please feel free to let me know.
> 
> > 2. Can you briefly introduce how to guarantee compatibility with
> Celeborn’s
> existing features such as partition splitting?
> 
> This integration work is a new way to make Celeborn work with Flink, so the
> compatibility of the old shuffle service mode is not affected. The new
> integration
> will also support the features of the old mode, e.g., the partition split
> will be
> supported by trying to open the stream from the next partition when the
> previous
> partition is read completely. Since these features are all implementation
> details,
> initially I didn't add them in the CIP to keep it focused, simple, and easy
> to
> understand. After the question, I have added some feature details to it.
> 
> > 3. Is there any public configuration of integration with Hybrid Shuffle
> and Flink
> client?
> 
> Yes, there is an added Flink configuration, which is described in the
> FLIP[2].
> 
> 
> > 4. How does the server side guarantee the accuracy and recoverability of
> Segment information?
> 
> Similar to other writing information, the segment info is also added to
> FileInfo.
> and the lock can protect it to guarantee accuracy. The recoverability is
> achieved
> by serialization and deserialization, which is also the same as other
> fields.
> 
> > 5. Should Celeborn wait until FLIP-459 is released before releasing this
> integration? Which Flink version will release FLIP-459?
> 
> Celeborn's integration should wait for FLIP-459 to be released. This is
> because
> the feature relies on both CIP-6 and FLIP-459 to function correctly. If all
> goes well,
> FLIP-459 could be part of Flink's next release, Flink 1.20.
> 
> 
> Hi, Keyong,
> 
> Thanks for the reminder and the interest in the Reduce Partition. After the
> Map
> Partition part is finished, we will continue to work on it as soon as
> possible.
> 
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn
> 
> Best,
> Yuxin
> 
> 
> Keyong Zhou  于2024年6月8日周六 13:00写道:
> 
> > Hi Yuxin and Xintong,
> >
> > Really excited to see Flink and Celeborn communities collaborate
> > more on shuffle component! I believe this will inspire more for both sides
> > :)
> >
> > +1 for this proposal, looking forward to see this feature to make progress.
> >
> > Also I'm very interested in integrating Flink Hybrid Shuffle with
> > Celeborn's
> > Reduce Partition as mentioned in the doc in the future, which I believe
> > will
> > benefit more for very large shuffle operators :)
> >
> > Regards,
> > Keyong Zhou
> >
> > Nicholas Jiang  于2024年6月6日周四 13:25写道:
> >
> > > Hi Yuxin,
> > >
> > > Thanks for driving this CIP about integration with Hybrid Shuffle. I have
> > > some comments on this CIP:
> > >
> > > 1. Could you describe in detail what functions the relevant components
> > > mentioned in Proposed Changes, including CelebornProducerAgent,
> > > CelebornConsumerAgent, CelebornMasterAgent, etc., support? In the design
> > > document, these components are only mentioned and no any details of
> > changes.
> > >
> > > 2. Can you briefly introduce how to guarantee compatibility with
> > > Celeborn’s existing features such as partition splitting? IMO, the
> > > compatibility introduction should be mentioned in Proposed Changes to
> > help
> > > community developers understand.
> > >
> > > 3. There are no changes on public interfaces. Is 

Re:Re: [VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Nicholas Jiang
+1. Looking forward to Celeborn CLI.




Regards,

Nicholas Jiang


At 2024-06-12 12:26:34, "Aravind Patnam"  wrote:
>Hi all,
>
>Sorry, this is the correct link to the Celeborn CLI CIP
>
>.
>
>Thanks,
>Aravind
>
>On Tue, Jun 11, 2024 at 9:24 PM Aravind Patnam  wrote:
>
>> Hi all,
>>
>> This is a call to vote to contribute the Celeborn CLI CIP
>> 
>>  to
>> Apache Celeborn.
>>
>> Please do vote accordingly:
>> [ ] +1 approve
>> [ ] +0 no opinion
>> [ ] -1 disapprove (and the reason)
>>
>> Thanks once again!!
>>
>> Aravind
>>
>
>
>-- 
>Aravind K. Patnam


Re: [VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Aravind Patnam
Hi all,

Sorry, this is the correct link to the Celeborn CLI CIP

.

Thanks,
Aravind

On Tue, Jun 11, 2024 at 9:24 PM Aravind Patnam  wrote:

> Hi all,
>
> This is a call to vote to contribute the Celeborn CLI CIP
> 
>  to
> Apache Celeborn.
>
> Please do vote accordingly:
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Thanks once again!!
>
> Aravind
>


-- 
Aravind K. Patnam


[VOTE] Contrinute Apache Celeborn CLI

2024-06-11 Thread Aravind Patnam
Hi all,

This is a call to vote to contribute the Celeborn CLI CIP

to
Apache Celeborn.

Please do vote accordingly:
[ ] +1 approve
[ ] +0 no opinion
[ ] -1 disapprove (and the reason)

Thanks once again!!

Aravind


Re: Re: [Discussion] Proposal Management in Celeborn Community

2024-06-11 Thread Mridul Muralidharan
  This sounds great, thanks !
We should definitely archive all proposals which were voted on - and track
them for posterity.

Regards,
Mridul


On Tue, Jun 11, 2024 at 5:04 AM rexxiong  wrote:

> Thank you, everyone. From our discussions, it appears that there is a
> general consensus on centralizing the archiving of CIPs within Confluence
> for efficient management.
> However, opinions diverge on the approach to commenting and discussing
> these CIPs. Xintong has shared valuable insights from existing Apache
> projects, which integrate Confluence usage with email lists for efficient
> tracking of discussions. I believe this approach suits us as well.
>
> However, A problem emerged when someone without a Confluence account tried
> to create a CIP within the Celeborn namespace on Confluence (thanks to
> Aravind for pointing out this problem).
> The issue stems from cwiki.apache.org's current policy that restricts new
> user registrations and permits access solely to Apache committers.
> Consequently, relying on Confluence for CIP drafting proves inconvenient
> for all contributors.
>
> In light of this, after discussions among PMC members, we adjust our CIP
> process. Our new plan involves utilizing alternative documentation tools,
> such as Google Docs, for drafting CIPs.
> Subsequently, discussions relevant to these CIPs will take place via our
> mailing lists.
> Finally, the responsibility falls to Celeborn's PMC members and Committers
> to ensure the appropriate archiving of the finalized CIPs within
> Confluence. More details about the CIP process can be found in CIP[1].
>
> [1]
>
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
>
>
> Thanks,
> Jiashu Xiong
>
> Xintong Song  于2024年5月30日周四 13:07写道:
>
> > In fact, Confluence does support inline comments.
> >
> > However, AFAIK communities that adopt Confluence-based proposal
> management
> > (e.g., Flink[1] / Paimon[2] / Kafka[3]) usually encourage discussions to
> > happen on the mailing list.
> >
> > IMHO, discussions in mailing lists are easier to track compared to inline
> > comments. People don't need to subscribe to notifications of individual
> > documents in order to receive updates on changes. For people who joined
> the
> > discussion late or revisit the discussion later, the mailing thread also
> > makes it easy to understand how the entire conversation has taken place.
> > Most importantly, discussions are better kept in one place rather than
> > separated in multiple places.
> >
> > Best,
> >
> > Xintong
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/PAIMON/Paimon+Improvement+Proposals
> >
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, May 30, 2024 at 12:33 PM Nicholas  wrote:
> >
> > > Hi Jiashu,
> > >
> > >
> > >
> > >
> > > +1 for me. According to my experience in the Flink community, the
> > > discussion of the CIP is commented in dev maillist instead of commented
> > in
> > > confluence.
> > >
> > >
> > >
> > >
> > > Anyway, the CIP is required to introduce new feature or major changes.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Nicholas Jiang
> > >
> > >
> > >
> > >
> > > At 2024-05-30 01:29:58, "Mridul Muralidharan" 
> wrote:
> > > >  Inline comments, discussions are invaluable for design docs - this
> is
> > > not
> > > >yet supported in confluence right ?
> > > >Another option would be to iterate and discuss through other means
> (like
> > > >google docs), and before vote, move it to the wiki - so that the
> > community
> > > >is deciding/voting on artifacts which are on the wiki.
> > > >This would also help in case proposals do not end up making it to the
> > vote
> > > >stage, but go through brainstorming/discussion - and evolve into
> > something
> > > >new (or get merged with others).
> > > >
> > > >Regards,
> > > >Mridul
> > > >
> > > >
> > > >On Wed, May 29, 2024 at 10:42 AM Keyong Zhou 
> wrote:
> > > >
> > > >> +1 for me.
> > > >>
> > > >> About the comments by Cheng, IMHO discussing in maillist is also
> > > acceptable
> > > >> (and even better)
> > > >>
> > > >> Regards,
> > > >> Keyong Zhou
> > > >>
> > > >> Cheng Pan  于2024年5月29日周三 14:32写道:
> > > >>
> > > >> > +1 for archiving proposals on confluence.
> > > >> >
> > > >> > Does Confluence support inline comments like Google Docs does? I
> > think
> > > >> > it’s a convincing functionality for the discussion period.
> > > >> >
> > > >> > Thanks,
> > > >> > Cheng Pan
> > > >> >
> > > >> >
> > > >> > > On May 29, 2024, at 11:19, rexxiong 
> wrote:
> > > >> > >
> > > >> > > Hello, Celeborn community,
> > > >> > >
> > > >> > > In the past, when Celeborn introduced new major features or
> > > significant
> > > >> > changes, we typically used Google Docs to launch proposals.
> > However, a
> > > >> > major 

Re: [VOTE] Release Apache Celeborn 0.5.0-rc1

2024-06-11 Thread Ethan Feng
This vote is canceled due to newly found license issues.

Ethan Feng  于2024年6月11日周二 17:48写道:
>
> Hello, Celeborn community,
>
> This is a call for a vote to release Apache Celeborn
> 0.5.0-rc1
>
> The git tag to be voted upon:
> https://github.com/apache/celeborn/releases/tag/v0.5.0-rc1
>
> Source and binary artifacts can be found at:
> https://dist.apache.org/repos/dist/dev/celeborn/v0.5.0-rc1
>
> The git commit hash:
> 734c42a81c7ed9ebe0c7dbe389a85f31cba116d2
>
> The staging repo:
> https://repository.apache.org/content/repositories/orgapacheceleborn-1067/
>
> The fingerprint of the PGP key release artifacts is signed with:
> FCF20BB29C7BEFDF58F998F76392F71F37356FA0
>
> My public key to verify signatures can be found in:
> https://dist.apache.org/repos/dist/release/celeborn/KEYS
>
> The vote will be open for at least 72 hours or until the necessary
> number of votes are reached.
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove (and the reason)
>
> Steps to validate the release:
> https://www.apache.org/info/verification.html
>
> * Download links, checksums, and PGP signatures are valid.
> * Source code distributions have correct names matching the current release.
> * LICENSE and NOTICE files are correct.
> * All files have license headers if necessary.
> * No unlicensed compiled archives bundled in the source archive.
> * The source tarball matches the git tag.
> * Build from source is successful.
>
> There are additional tests:
> * Performance test no regression
> 1 TB TPC-DS, 0.5.0 VS 0.4.1 : 2042(s) VS 2050(s)
> 1.1 TB pure shuffle, 0.5.0 VS 0.4.1 : 11.8min vs 11.8min
>
> * Result correctness test passed
> 1TB TPC-DS runs concurrently, the results are identical.
>
> * Usability test passed
> Rolling upgrade from version 0.4.1 to 0.5.0 succeed.
> The metrics system works as expected.
>
> * Stability test passed
> Random worker failures, Celeborn works as expected.
> Random master failures, Celeborn works as expected.
> Master meta corrupted, Celeborn works as expected.
>
> * Compatibility test passed
> The Celeborn server version of 0.5.0 works fine with the Celeborn client 
> 0.4.1.
>
>
> Regards,
> Ethan Feng


Re: Re: [Discussion] Proposal Management in Celeborn Community

2024-06-11 Thread rexxiong
Thank you, everyone. From our discussions, it appears that there is a
general consensus on centralizing the archiving of CIPs within Confluence
for efficient management.
However, opinions diverge on the approach to commenting and discussing
these CIPs. Xintong has shared valuable insights from existing Apache
projects, which integrate Confluence usage with email lists for efficient
tracking of discussions. I believe this approach suits us as well.

However, A problem emerged when someone without a Confluence account tried
to create a CIP within the Celeborn namespace on Confluence (thanks to
Aravind for pointing out this problem).
The issue stems from cwiki.apache.org's current policy that restricts new
user registrations and permits access solely to Apache committers.
Consequently, relying on Confluence for CIP drafting proves inconvenient
for all contributors.

In light of this, after discussions among PMC members, we adjust our CIP
process. Our new plan involves utilizing alternative documentation tools,
such as Google Docs, for drafting CIPs.
Subsequently, discussions relevant to these CIPs will take place via our
mailing lists.
Finally, the responsibility falls to Celeborn's PMC members and Committers
to ensure the appropriate archiving of the finalized CIPs within
Confluence. More details about the CIP process can be found in CIP[1].

[1]
https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals


Thanks,
Jiashu Xiong

Xintong Song  于2024年5月30日周四 13:07写道:

> In fact, Confluence does support inline comments.
>
> However, AFAIK communities that adopt Confluence-based proposal management
> (e.g., Flink[1] / Paimon[2] / Kafka[3]) usually encourage discussions to
> happen on the mailing list.
>
> IMHO, discussions in mailing lists are easier to track compared to inline
> comments. People don't need to subscribe to notifications of individual
> documents in order to receive updates on changes. For people who joined the
> discussion late or revisit the discussion later, the mailing thread also
> makes it easy to understand how the entire conversation has taken place.
> Most importantly, discussions are better kept in one place rather than
> separated in multiple places.
>
> Best,
>
> Xintong
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> [2]
>
> https://cwiki.apache.org/confluence/display/PAIMON/Paimon+Improvement+Proposals
>
> [3]
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>
>
> Best,
>
> Xintong
>
>
>
> On Thu, May 30, 2024 at 12:33 PM Nicholas  wrote:
>
> > Hi Jiashu,
> >
> >
> >
> >
> > +1 for me. According to my experience in the Flink community, the
> > discussion of the CIP is commented in dev maillist instead of commented
> in
> > confluence.
> >
> >
> >
> >
> > Anyway, the CIP is required to introduce new feature or major changes.
> >
> >
> >
> >
> > Regards,
> >
> > Nicholas Jiang
> >
> >
> >
> >
> > At 2024-05-30 01:29:58, "Mridul Muralidharan"  wrote:
> > >  Inline comments, discussions are invaluable for design docs - this is
> > not
> > >yet supported in confluence right ?
> > >Another option would be to iterate and discuss through other means (like
> > >google docs), and before vote, move it to the wiki - so that the
> community
> > >is deciding/voting on artifacts which are on the wiki.
> > >This would also help in case proposals do not end up making it to the
> vote
> > >stage, but go through brainstorming/discussion - and evolve into
> something
> > >new (or get merged with others).
> > >
> > >Regards,
> > >Mridul
> > >
> > >
> > >On Wed, May 29, 2024 at 10:42 AM Keyong Zhou  wrote:
> > >
> > >> +1 for me.
> > >>
> > >> About the comments by Cheng, IMHO discussing in maillist is also
> > acceptable
> > >> (and even better)
> > >>
> > >> Regards,
> > >> Keyong Zhou
> > >>
> > >> Cheng Pan  于2024年5月29日周三 14:32写道:
> > >>
> > >> > +1 for archiving proposals on confluence.
> > >> >
> > >> > Does Confluence support inline comments like Google Docs does? I
> think
> > >> > it’s a convincing functionality for the discussion period.
> > >> >
> > >> > Thanks,
> > >> > Cheng Pan
> > >> >
> > >> >
> > >> > > On May 29, 2024, at 11:19, rexxiong  wrote:
> > >> > >
> > >> > > Hello, Celeborn community,
> > >> > >
> > >> > > In the past, when Celeborn introduced new major features or
> > significant
> > >> > changes, we typically used Google Docs to launch proposals.
> However, a
> > >> > major issue with Google Docs is the difficulty in centrally managing
> > >> these
> > >> > proposals. Therefore, after referring to other communities and based
> > on
> > >> > discussions with several PMCs offline, it appears that Apache
> > Confluence
> > >> > could be a viable alternative for our needs. With that in mind, I
> > would
> > >> > like to invite all of you to share your thoughts, experiences, and
> > >> > preferences regarding the use of Apache Confluence versus Google
> Docs
> > for
> > >> > 

[VOTE] Release Apache Celeborn 0.5.0-rc1

2024-06-11 Thread Ethan Feng
Hello, Celeborn community,

This is a call for a vote to release Apache Celeborn
0.5.0-rc1

The git tag to be voted upon:
https://github.com/apache/celeborn/releases/tag/v0.5.0-rc1

Source and binary artifacts can be found at:
https://dist.apache.org/repos/dist/dev/celeborn/v0.5.0-rc1

The git commit hash:
734c42a81c7ed9ebe0c7dbe389a85f31cba116d2

The staging repo:
https://repository.apache.org/content/repositories/orgapacheceleborn-1067/

The fingerprint of the PGP key release artifacts is signed with:
FCF20BB29C7BEFDF58F998F76392F71F37356FA0

My public key to verify signatures can be found in:
https://dist.apache.org/repos/dist/release/celeborn/KEYS

The vote will be open for at least 72 hours or until the necessary
number of votes are reached.

Please vote accordingly:

[ ] +1 approve
[ ] +0 no opinion
[ ] -1 disapprove (and the reason)

Steps to validate the release:
https://www.apache.org/info/verification.html

* Download links, checksums, and PGP signatures are valid.
* Source code distributions have correct names matching the current release.
* LICENSE and NOTICE files are correct.
* All files have license headers if necessary.
* No unlicensed compiled archives bundled in the source archive.
* The source tarball matches the git tag.
* Build from source is successful.

There are additional tests:
* Performance test no regression
1 TB TPC-DS, 0.5.0 VS 0.4.1 : 2042(s) VS 2050(s)
1.1 TB pure shuffle, 0.5.0 VS 0.4.1 : 11.8min vs 11.8min

* Result correctness test passed
1TB TPC-DS runs concurrently, the results are identical.

* Usability test passed
Rolling upgrade from version 0.4.1 to 0.5.0 succeed.
The metrics system works as expected.

* Stability test passed
Random worker failures, Celeborn works as expected.
Random master failures, Celeborn works as expected.
Master meta corrupted, Celeborn works as expected.

* Compatibility test passed
The Celeborn server version of 0.5.0 works fine with the Celeborn client 0.4.1.


Regards,
Ethan Feng


Re: [DISCUSS] Celeborn CLI Proposal

2024-06-11 Thread Aravind Patnam
Sounds good!! I will start a vote thread.

Thanks,
Aravind

Aravind K. Patnam


On Tue, Jun 11, 2024 at 1:01 AM rexxiong  wrote:

> Hi Aravind,
>
> Thank you for the proposal. It looks good to me, agree with Yu that there
> is no blocker. Please go ahead and initiate a formal vote for this CIP, See
> the process section in the CIP guidelines
> <
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> >
> .
>
>
> Thanks,
> Jiashu Xiong
>
> Yu Li  于2024年6月11日周二 15:26写道:
>
> > IIUIC, we need a formal vote on this CIP and make sure there are (at
> > least one) committer(s) to help review the coming up PRs.
> >
> > From the existing discussions, I believe there's no blocker but just a
> > normal process for this CIP to move on (smile).
> >
> > And thanks for the proposal, Aravind!
> >
> > Best Regards,
> > Yu
> >
> > On Tue, 11 Jun 2024 at 13:10, Aravind Patnam 
> wrote:
> > >
> > > Hi all,
> > >
> > > Thanks for the reviews everyone!!
> > >
> > > Is there any other process required (such as a vote thread), or can I
> > start
> > > contributing the CLI now?
> > >
> > > Thanks,
> > > Aravind
> > >
> > > Aravind K. Patnam
> > >
> > >
> > > On Mon, Jun 10, 2024 at 11:29 AM Mridul Muralidharan  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > >   Looks good to me as well, I had reviewed this proposal internally
> > already
> > > > :-)
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > >
> > > > On Fri, Jun 7, 2024 at 11:32 PM Keyong Zhou 
> wrote:
> > > >
> > > > > Hi Aravind,
> > > > >
> > > > > Thanks for the proposal! The proposal LGTM, I think it's very
> > valuable.
> > > > >
> > > > > Regards,
> > > > > Keyong Zhou
> > > > >
> > > > > Aravind Patnam  于2024年6月7日周五 12:47写道:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Thanks Nicholas for the comments!
> > > > > >
> > > > > > I now got access to put the proposal in Confluence in the form of
> > CIP,
> > > > > here
> > > > > > <
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI
> > > > > > >
> > > > > > it is.
> > > > > >
> > > > > > Regarding your questions:
> > > > > >
> > > > > > > 1. From a user's perspective, the CLI is more used for some
> > > > maintenance
> > > > > > operations such as online and offline of server, rescaling of
> > cluster
> > > > > etc,
> > > > > > not only based on the REST API. What CLI interfaces are there
> that
> > the
> > > > > REST
> > > > > > API doesn’t have for maintenance?
> > > > > > This is highly dependent on what the user is leveraging to manage
> > their
> > > > > > cluster. For example, in k8s, you would be using k8s APIs to
> > achieve
> > > > > this.
> > > > > > We can probably add a generic interface API for it that provides
> > basic
> > > > > > operations that users can implement themselves for their cluster
> > > > > management
> > > > > > logic based on what cluster managers they are using. Although, I
> > think
> > > > > this
> > > > > > will likely be a later evolution of the CLI, once basic REST API
> > > > > operations
> > > > > > are implemented in the CLI. WDYT?
> > > > > >
> > > > > > > 2. There are same sub-commands between MASTER and WORKER. Why
> not
> > > > these
> > > > > > sub-commands belong to BOTH?
> > > > > > Agreed - this was a formatting mistake. I fixed it now, thanks
> for
> > > > > pointing
> > > > > > that out.
> > > > > >
> > > > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the
> > CLI
> > > > > works
> > > > > > well no matter the server is alive.
> > > > > > Yes, I agree. I think for this we would have to talk to the
> cluster
> > > > > > manager, similar to my response to #1. We would have to query the
> > > > > specific
> > > > > > cluster manager to get details if the Celeborn servers are dead,
> > since
> > > > > the
> > > > > > Celeborn REST API would not work then. We can add a generic API
> > that
> > > > > users
> > > > > > can implement based on their own environment.
> > > > > >
> > > > > > Thanks,
> > > > > > Aravind
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 5, 2024 at 10:43 PM Nicholas Jiang <
> > > > nicholasji...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Aravind,
> > > > > > >
> > > > > > > Thanks for driving this CIP about Celeborn CLI. I have some
> > comments
> > > > on
> > > > > > > this CIP:
> > > > > > >
> > > > > > > 1. From a user's perspective, the CLI is more used for some
> > > > maintenance
> > > > > > > operations such as online and offline of server, rescaling of
> > cluster
> > > > > > etc,
> > > > > > > not only based on the REST API. What CLI interfaces are there
> > that
> > > > the
> > > > > > REST
> > > > > > > API doesn’t have for maintenance?
> > > > > > >
> > > > > > > 2. There are same sub-commands between MASTER and WORKER. Why
> not
> > > > these
> > > > > > > sub-commands belong to BOTH?
> > > > > > >
> > > > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the
> > CLI
> > > > > 

Re: [DISCUSSION] CIP-6: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-11 Thread Yuxin Tan
Hi Nicholas,

Thanks for the valuable feedbacks.

> 1.  Could you describe in detail what functions the relevant components
mentioned in Proposed Changes

These components are only the pluggable implementations of the Celeborn
tier.
The details and the mechanisms of switching between tiers are in the
previous
FLIP[1]. The Celeborn, as a new tier, is added to hybrid shuffle, sharing
the
similarities with existing tiers, such as the Memory tier and Disk tier. In
this tiered
storage, agents serve as the entry points of interaction between the
framework
and different tiers. For instance, CelebornProducerAgent acts as the entry
point
for producers to emit data into the tier. If there are still more similar
questions
after referencing that FLIP, please feel free to let me know.

> 2. Can you briefly introduce how to guarantee compatibility with
Celeborn’s
existing features such as partition splitting?

This integration work is a new way to make Celeborn work with Flink, so the
compatibility of the old shuffle service mode is not affected. The new
integration
will also support the features of the old mode, e.g., the partition split
will be
supported by trying to open the stream from the next partition when the
previous
partition is read completely. Since these features are all implementation
details,
initially I didn't add them in the CIP to keep it focused, simple, and easy
to
understand. After the question, I have added some feature details to it.

> 3. Is there any public configuration of integration with Hybrid Shuffle
and Flink
client?

Yes, there is an added Flink configuration, which is described in the
FLIP[2].


> 4. How does the server side guarantee the accuracy and recoverability of
Segment information?

Similar to other writing information, the segment info is also added to
FileInfo.
and the lock can protect it to guarantee accuracy. The recoverability is
achieved
by serialization and deserialization, which is also the same as other
fields.

> 5. Should Celeborn wait until FLIP-459 is released before releasing this
integration? Which Flink version will release FLIP-459?

Celeborn's integration should wait for FLIP-459 to be released. This is
because
the feature relies on both CIP-6 and FLIP-459 to function correctly. If all
goes well,
FLIP-459 could be part of Flink's next release, Flink 1.20.


Hi, Keyong,

Thanks for the reminder and the interest in the Reduce Partition. After the
Map
Partition part is finished, we will continue to work on it as soon as
possible.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn

Best,
Yuxin


Keyong Zhou  于2024年6月8日周六 13:00写道:

> Hi Yuxin and Xintong,
>
> Really excited to see Flink and Celeborn communities collaborate
> more on shuffle component! I believe this will inspire more for both sides
> :)
>
> +1 for this proposal, looking forward to see this feature to make progress.
>
> Also I'm very interested in integrating Flink Hybrid Shuffle with
> Celeborn's
> Reduce Partition as mentioned in the doc in the future, which I believe
> will
> benefit more for very large shuffle operators :)
>
> Regards,
> Keyong Zhou
>
> Nicholas Jiang  于2024年6月6日周四 13:25写道:
>
> > Hi Yuxin,
> >
> > Thanks for driving this CIP about integration with Hybrid Shuffle. I have
> > some comments on this CIP:
> >
> > 1. Could you describe in detail what functions the relevant components
> > mentioned in Proposed Changes, including CelebornProducerAgent,
> > CelebornConsumerAgent, CelebornMasterAgent, etc., support? In the design
> > document, these components are only mentioned and no any details of
> changes.
> >
> > 2. Can you briefly introduce how to guarantee compatibility with
> > Celeborn’s existing features such as partition splitting? IMO, the
> > compatibility introduction should be mentioned in Proposed Changes to
> help
> > community developers understand.
> >
> > 3. There are no changes on public interfaces. Is there any public
> > configuration of integration with Hybrid Shuffle and Flink client?
> >
> > 4. The server side must store Segment information for each subpartition.
> > How does the server side guarantee the accuracy and recoverability of
> > Segment information?
> >
> > 5. Should Celeborn wait until FLIP-459 is released before releasing this
> > integration? Which Flink version will release FLIP-459?
> >
> > Regards,
> > Nicholas Jiang
> >
> > On 2024/05/28 12:51:32 Yuxin Tan wrote:
> > > Hi all,
> > >
> > > I would like to start a discussion on CIP-6 Support Flink hybrid
> shuffle
> > > integration with Apache
> > > Celeborn[1]. Celeborn provides a stable, performant, scalable remote
> > > shuffle service.
> > > Concurrently, Flink hybrid shuffle supports transitions between memory,
> > > disk, and remote
> > > storage to improve performance and job stability. This integration
> > 

Re: [DISCUSS] Celeborn CLI Proposal

2024-06-11 Thread rexxiong
Hi Aravind,

Thank you for the proposal. It looks good to me, agree with Yu that there
is no blocker. Please go ahead and initiate a formal vote for this CIP, See
the process section in the CIP guidelines

.


Thanks,
Jiashu Xiong

Yu Li  于2024年6月11日周二 15:26写道:

> IIUIC, we need a formal vote on this CIP and make sure there are (at
> least one) committer(s) to help review the coming up PRs.
>
> From the existing discussions, I believe there's no blocker but just a
> normal process for this CIP to move on (smile).
>
> And thanks for the proposal, Aravind!
>
> Best Regards,
> Yu
>
> On Tue, 11 Jun 2024 at 13:10, Aravind Patnam  wrote:
> >
> > Hi all,
> >
> > Thanks for the reviews everyone!!
> >
> > Is there any other process required (such as a vote thread), or can I
> start
> > contributing the CLI now?
> >
> > Thanks,
> > Aravind
> >
> > Aravind K. Patnam
> >
> >
> > On Mon, Jun 10, 2024 at 11:29 AM Mridul Muralidharan 
> > wrote:
> >
> > > Hi,
> > >
> > >   Looks good to me as well, I had reviewed this proposal internally
> already
> > > :-)
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Fri, Jun 7, 2024 at 11:32 PM Keyong Zhou  wrote:
> > >
> > > > Hi Aravind,
> > > >
> > > > Thanks for the proposal! The proposal LGTM, I think it's very
> valuable.
> > > >
> > > > Regards,
> > > > Keyong Zhou
> > > >
> > > > Aravind Patnam  于2024年6月7日周五 12:47写道:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks Nicholas for the comments!
> > > > >
> > > > > I now got access to put the proposal in Confluence in the form of
> CIP,
> > > > here
> > > > > <
> > > >
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI
> > > > > >
> > > > > it is.
> > > > >
> > > > > Regarding your questions:
> > > > >
> > > > > > 1. From a user's perspective, the CLI is more used for some
> > > maintenance
> > > > > operations such as online and offline of server, rescaling of
> cluster
> > > > etc,
> > > > > not only based on the REST API. What CLI interfaces are there that
> the
> > > > REST
> > > > > API doesn’t have for maintenance?
> > > > > This is highly dependent on what the user is leveraging to manage
> their
> > > > > cluster. For example, in k8s, you would be using k8s APIs to
> achieve
> > > > this.
> > > > > We can probably add a generic interface API for it that provides
> basic
> > > > > operations that users can implement themselves for their cluster
> > > > management
> > > > > logic based on what cluster managers they are using. Although, I
> think
> > > > this
> > > > > will likely be a later evolution of the CLI, once basic REST API
> > > > operations
> > > > > are implemented in the CLI. WDYT?
> > > > >
> > > > > > 2. There are same sub-commands between MASTER and WORKER. Why not
> > > these
> > > > > sub-commands belong to BOTH?
> > > > > Agreed - this was a formatting mistake. I fixed it now, thanks for
> > > > pointing
> > > > > that out.
> > > > >
> > > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the
> CLI
> > > > works
> > > > > well no matter the server is alive.
> > > > > Yes, I agree. I think for this we would have to talk to the cluster
> > > > > manager, similar to my response to #1. We would have to query the
> > > > specific
> > > > > cluster manager to get details if the Celeborn servers are dead,
> since
> > > > the
> > > > > Celeborn REST API would not work then. We can add a generic API
> that
> > > > users
> > > > > can implement based on their own environment.
> > > > >
> > > > > Thanks,
> > > > > Aravind
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 5, 2024 at 10:43 PM Nicholas Jiang <
> > > nicholasji...@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Aravind,
> > > > > >
> > > > > > Thanks for driving this CIP about Celeborn CLI. I have some
> comments
> > > on
> > > > > > this CIP:
> > > > > >
> > > > > > 1. From a user's perspective, the CLI is more used for some
> > > maintenance
> > > > > > operations such as online and offline of server, rescaling of
> cluster
> > > > > etc,
> > > > > > not only based on the REST API. What CLI interfaces are there
> that
> > > the
> > > > > REST
> > > > > > API doesn’t have for maintenance?
> > > > > >
> > > > > > 2. There are same sub-commands between MASTER and WORKER. Why not
> > > these
> > > > > > sub-commands belong to BOTH?
> > > > > >
> > > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the
> CLI
> > > > works
> > > > > > well no matter the server is alive.
> > > > > >
> > > > > > BTW, could this design doc of proposal follow the template of
> CIP[1]?
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> > > > > >
> > > > > > Regards,
> > > > > > Nicholas Jiang
> > > > > >
> > > > > > On 2024/06/05 23:33:02 Aravind Patnam wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > 

Re: [DISCUSS] Celeborn CLI Proposal

2024-06-11 Thread Yu Li
IIUIC, we need a formal vote on this CIP and make sure there are (at
least one) committer(s) to help review the coming up PRs.

>From the existing discussions, I believe there's no blocker but just a
normal process for this CIP to move on (smile).

And thanks for the proposal, Aravind!

Best Regards,
Yu

On Tue, 11 Jun 2024 at 13:10, Aravind Patnam  wrote:
>
> Hi all,
>
> Thanks for the reviews everyone!!
>
> Is there any other process required (such as a vote thread), or can I start
> contributing the CLI now?
>
> Thanks,
> Aravind
>
> Aravind K. Patnam
>
>
> On Mon, Jun 10, 2024 at 11:29 AM Mridul Muralidharan 
> wrote:
>
> > Hi,
> >
> >   Looks good to me as well, I had reviewed this proposal internally already
> > :-)
> >
> > Regards,
> > Mridul
> >
> >
> > On Fri, Jun 7, 2024 at 11:32 PM Keyong Zhou  wrote:
> >
> > > Hi Aravind,
> > >
> > > Thanks for the proposal! The proposal LGTM, I think it's very valuable.
> > >
> > > Regards,
> > > Keyong Zhou
> > >
> > > Aravind Patnam  于2024年6月7日周五 12:47写道:
> > >
> > > > Hi,
> > > >
> > > > Thanks Nicholas for the comments!
> > > >
> > > > I now got access to put the proposal in Confluence in the form of CIP,
> > > here
> > > > <
> > >
> > https://cwiki.apache.org/confluence/display/CELEBORN/CIP+7+-+Celeborn+CLI
> > > > >
> > > > it is.
> > > >
> > > > Regarding your questions:
> > > >
> > > > > 1. From a user's perspective, the CLI is more used for some
> > maintenance
> > > > operations such as online and offline of server, rescaling of cluster
> > > etc,
> > > > not only based on the REST API. What CLI interfaces are there that the
> > > REST
> > > > API doesn’t have for maintenance?
> > > > This is highly dependent on what the user is leveraging to manage their
> > > > cluster. For example, in k8s, you would be using k8s APIs to achieve
> > > this.
> > > > We can probably add a generic interface API for it that provides basic
> > > > operations that users can implement themselves for their cluster
> > > management
> > > > logic based on what cluster managers they are using. Although, I think
> > > this
> > > > will likely be a later evolution of the CLI, once basic REST API
> > > operations
> > > > are implemented in the CLI. WDYT?
> > > >
> > > > > 2. There are same sub-commands between MASTER and WORKER. Why not
> > these
> > > > sub-commands belong to BOTH?
> > > > Agreed - this was a formatting mistake. I fixed it now, thanks for
> > > pointing
> > > > that out.
> > > >
> > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI
> > > works
> > > > well no matter the server is alive.
> > > > Yes, I agree. I think for this we would have to talk to the cluster
> > > > manager, similar to my response to #1. We would have to query the
> > > specific
> > > > cluster manager to get details if the Celeborn servers are dead, since
> > > the
> > > > Celeborn REST API would not work then. We can add a generic API that
> > > users
> > > > can implement based on their own environment.
> > > >
> > > > Thanks,
> > > > Aravind
> > > >
> > > >
> > > >
> > > > On Wed, Jun 5, 2024 at 10:43 PM Nicholas Jiang <
> > nicholasji...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Hi Aravind,
> > > > >
> > > > > Thanks for driving this CIP about Celeborn CLI. I have some comments
> > on
> > > > > this CIP:
> > > > >
> > > > > 1. From a user's perspective, the CLI is more used for some
> > maintenance
> > > > > operations such as online and offline of server, rescaling of cluster
> > > > etc,
> > > > > not only based on the REST API. What CLI interfaces are there that
> > the
> > > > REST
> > > > > API doesn’t have for maintenance?
> > > > >
> > > > > 2. There are same sub-commands between MASTER and WORKER. Why not
> > these
> > > > > sub-commands belong to BOTH?
> > > > >
> > > > > 3. Does the implementation of CLI invoke the REST API? IMO, the CLI
> > > works
> > > > > well no matter the server is alive.
> > > > >
> > > > > BTW, could this design doc of proposal follow the template of CIP[1]?
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/CELEBORN/Celeborn+Improvement+Proposals
> > > > >
> > > > > Regards,
> > > > > Nicholas Jiang
> > > > >
> > > > > On 2024/06/05 23:33:02 Aravind Patnam wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I have written up a proposal about introducing a CLI for Celeborn.
> > > You
> > > > > can
> > > > > > find the proposal
> > > > > > <
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1j9wKFSR_ychYDF0NU5YN67WCCtNAgYTbN5CN8V3SOnk/edit?usp=sharing
> > > > > >
> > > > > > here.
> > > > > > Please let me know if you have any comments or questions.
> > > > > >
> > > > > > TLDR by introducing a CLI, it would complement the existing
> > dashboard
> > > > and
> > > > > > would benefit us internally. We rely on CLI tools internally a lot
> > > for
> > > > > > automation and other operations.
> > > > > >
> > > > > > FYI, I was not able to access the cwiki page to put