Re: EMR + celeborn
Hi rahul, Are you referring to AWS EMR or Aliyun EMR? You are right that Spark driver and executors talk directly to Celeborn Master and Workers, as long as the network are connected, there should be no issue. For Aliyun EMR, if the EMR cluster and Celeborn are in the same VPC, they can directly talk to each other without other setup. I'm not that familiar with AWS EMR, but I think it should be similar. I'll ask some of the Celeborn users to see if they can provide some information. BTW, if you are using AWS EMR with Celeborn, please make sure the following config is set to false: spark.celeborn.client.spark.push.unsafeRow.fastWrite.enabled=false Regards, Keyong Zhou rahul gidwani 于2024年6月18日周二 04:53写道: > Hello, > > I was wondering if anyone here has tried running celeborn in k8s for spark > on EMR? Our spark clusters run in EMR currently and want to utilize > celeborn as our shuffle, but was wondering if there was any tricky > networking setup to get this working? Since I assume that any node in EMR > would eventually talk directly to a worker node. > > Thank you > >
Necessary config for AWS Spark
Dear Celeborn users, If you are running AWS Spark with Celeborn, please make sure the following spark config is added, or there might be undefined behavior: spark.celeborn.client.spark.push.unsafeRow.fastWrite.enabled=false Regards, Keyong Zhou
[ANNOUNCE] Add Mridul Muralidharan as new committer
Hi Celeborn Community, The Project Management Committee (PMC) for Apache Celeborn has invited Mridul Muralidharan to become a committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since there is no need to go via the patch submission process. This should enable better productivity. A PMC member helps manage and guide the direction of the project. Please join me in congratulating Mridul Muralidharan! Regards, Keyong Zhou
Re: How does Celeborn handles FetchFailed exception
Hi Sanskar, Thanks for your interest in Celeborn! Here is design doc[1] and and this thread[2] has extensive discussion, which I believe can be helpful. For your question, LifecycleManager registers shuffle tracker callback after creation, which calls unregisterAllMapAndMergeOutput: lifecycleManager.registerShuffleTrackerCallback( shuffleId -> SparkUtils.unregisterAllMapOutput(mapOutputTracker, shuffleId)); The callback will be called when LifecycleManager handles shuffle fetch failure: private def handleReportShuffleFetchFailure( context: RpcCallContext, appShuffleId: Int, shuffleId: Int): Unit = { ... appShuffleTrackerCallback match { case Some(callback) => try { callback.accept(appShuffleId) ... } And ShuffleClient inside Executors reports shuffle fetch failure whenever fetch failure is encountered: public boolean reportShuffleFetchFailure(int appShuffleId, int shuffleId) { ... PbReportShuffleFetchFailureResponse pbReportShuffleFetchFailureResponse = lifecycleManagerRef.askSync( pbReportShuffleFetchFailure, conf.clientRpcRegisterShuffleRpcAskTimeout(), ClassTag$.MODULE$.apply(PbReportShuffleFetchFailureResponse.class)); ... } cc Erik, Mridul [1] https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8 [2] https://lists.apache.org/thread/nogsgc61qh2zomdvhfbs4b0y88s3qtqg Regards, Keyong Zhou Sanskar Modi 于2024年4月3日周三 18:35写道: > Hi Celebron community, > > I wanted to understand better on how Celeborn handles FetchFailed > exceptions. In Spark `DAGScheduler` fetch failure handling, code tries to > unregister the map output for the fetch failed mapIndex > > } else if (mapIndex != -1) { > // Mark the map whose fetch failed as broken in the map stage > mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress) > > } > > But in the Celeborn case mapIndex will always be -1. So how does the > shuffle output get cleared for the same. > Ideally mapOutputTracker.unregisterAllMapAndMergeOutput(shuffleId) should > be called for the fetch failed stage but I'm not able to find that code > piece. > > Can someone help me understand this, I might be missing something basic > here. >
[ANNOUNCE] Add Chandni Singh as new committer
Hi Celeborn Community, The Podling Project Management Committee (PPMC) for Apache Celeborn has invited Chandni Singh to become a committer and we are pleased to announce that she has accepted. Being a committer enables easier contribution to the project since there is no need to go via the patch submission process. This should enable better productivity. A (P)PMC member helps manage and guide the direction of the project. Please join me in congratulating Chandni Singh! Thanks, Keyong Zhou
Re: Celeborn for Spark 3.2 + JDK17
I think so. Do you mind sending a PR to Celeborn? :) Regards, Keyong Zhou Curtis Howard 于2024年3月12日 周二22:58写道: > Thanks Keyong, > > I was able to build the 'client' JARs for Spark 3.2 / JDK21, and run > simple tests; however I did face a runtime error similar to this: > > Caused by: java.lang.ExceptionInInitializerError: Exception > java.lang.IllegalStateException: java.lang.NoSuchMethodException: > java.nio.DirectByteBuffer.(long, int) [in thread "Executor task > launch worker for task 0.0 in stage 0.0 (TID 0)"] > > at > org.apache.celeborn.common.unsafe.Platform.(Platform.java:135) > > ... 16 more > > I used the following patch (borrowed from the almost identical upstream > Spark project patch for SPARK-42369 > <https://github.com/apache/spark/pull/39909> that was required for JDK21, > as a result of JDK-8303083 <https://bugs.openjdk.org/browse/JDK-8303083>), > to work around this successfully (I think it may be required in Celeborn > Platform.java as well, for JDK21?): > > diff --git > a/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java > b/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java > > index ec541a77..218d517b 100644 > > --- a/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java > > +++ b/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java > > @ -90,7 +90,15 @@ public final class Platform { > > } > > try { > >Class cls = Class.forName("java.nio.DirectByteBuffer"); > > - Constructor constructor = cls.getDeclaredConstructor(Long.TYPE, > Integer.TYPE); > > + Constructor constructor; > > + try { > > +constructor = cls.getDeclaredConstructor(Long.TYPE, > Integer.TYPE); > > + } catch (NoSuchMethodException e) { > > +// DirectByteBuffer(long,int) was removed in > > +// > https://github.com/openjdk/jdk/commit/a56598f5a534cc9223367e7faa8433ea38661db9 > > +constructor = cls.getDeclaredConstructor(Long.TYPE, Long.TYPE); > > + } > > + > > Field cleanerField = cls.getDeclaredField("cleaner"); > > try { > > constructor.setAccessible(true); > > > > > Curtis > > > > On Tue, Mar 12, 2024 at 10:32 AM Keyong Zhou wrote: > >> Hi Curtis, >> >> With this PR https://github.com/apache/incubator-celeborn/pull/2385 you >> can compile with JDK21 using following commands: >> ./build/make-distribution.sh -Pspark-3.5 -Pjdk-21 >> >> Regards, >> Keyong Zhou >> >> Curtis Howard 于2024年3月9日周六 02:38写道: >> >>> Thank you Keyong! >>> >>> Related to this, has testing started for Celeborn with JDK21? (any >>> anticipated concerns there, based on what you know so far?). >>> We will be migrating to JDK21 shortly, which is why I ask. >>> >>> Thanks again >>> Curtis >>> >>> On Fri, Mar 8, 2024 at 11:05 AM Keyong Zhou wrote: >>> >>>> Hi Curtis, >>>> >>>> Thanks for reaching out! >>>> >>>> No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I >>>> think there is a good chance that it could be used successfully. >>>> >>>> Any problems with your test, feel free to let us know :) >>>> >>>> Regards, >>>> Keyong Zhou >>>> >>>> Curtis Howard 于2024年3月8日周五 22:23写道: >>>> >>>>> Hi, >>>>> >>>>> We would like to confirm the reason for Celeborn not being listed with >>>>> Spark 3.2 and JDK17, as shown in the compatibility matrix here: >>>>> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build >>>>> >>>>> Is the reason for this only because the Apache Spark 3.2 release does >>>>> not officially support JDK17? (as covered in >>>>> https://issues.apache.org/jira/browse/SPARK-33772), or have >>>>> another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17 >>>>> version combination? >>>>> >>>>> We currently build Spark ourselves with custom dependencies, and are >>>>> successfully using Spark 3.2 with JDK17. Understanding that this >>>>> combination has likely not been tested with Celeborn, we are wondering if >>>>> there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there >>>>> is a good chance that it could still be used successfully. >>>>> >>>>> Thank you! >>>>> Curtis >>>>> >>>>
Re: Celeborn for Spark 3.2 + JDK17
Hi Curtis, With this PR https://github.com/apache/incubator-celeborn/pull/2385 you can compile with JDK21 using following commands: ./build/make-distribution.sh -Pspark-3.5 -Pjdk-21 Regards, Keyong Zhou Curtis Howard 于2024年3月9日周六 02:38写道: > Thank you Keyong! > > Related to this, has testing started for Celeborn with JDK21? (any > anticipated concerns there, based on what you know so far?). > We will be migrating to JDK21 shortly, which is why I ask. > > Thanks again > Curtis > > On Fri, Mar 8, 2024 at 11:05 AM Keyong Zhou wrote: > >> Hi Curtis, >> >> Thanks for reaching out! >> >> No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I >> think there is a good chance that it could be used successfully. >> >> Any problems with your test, feel free to let us know :) >> >> Regards, >> Keyong Zhou >> >> Curtis Howard 于2024年3月8日周五 22:23写道: >> >>> Hi, >>> >>> We would like to confirm the reason for Celeborn not being listed with >>> Spark 3.2 and JDK17, as shown in the compatibility matrix here: >>> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build >>> >>> Is the reason for this only because the Apache Spark 3.2 release does >>> not officially support JDK17? (as covered in >>> https://issues.apache.org/jira/browse/SPARK-33772), or have >>> another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17 >>> version combination? >>> >>> We currently build Spark ourselves with custom dependencies, and are >>> successfully using Spark 3.2 with JDK17. Understanding that this >>> combination has likely not been tested with Celeborn, we are wondering if >>> there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there >>> is a good chance that it could still be used successfully. >>> >>> Thank you! >>> Curtis >>> >>
Re: Celeborn for Spark 3.2 + JDK17
Hi Curtis, Thanks for reaching out! No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I think there is a good chance that it could be used successfully. Any problems with your test, feel free to let us know :) Regards, Keyong Zhou Curtis Howard 于2024年3月8日周五 22:23写道: > Hi, > > We would like to confirm the reason for Celeborn not being listed with > Spark 3.2 and JDK17, as shown in the compatibility matrix here: > https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build > > Is the reason for this only because the Apache Spark 3.2 release does not > officially support JDK17? (as covered in > https://issues.apache.org/jira/browse/SPARK-33772), or have > another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17 > version combination? > > We currently build Spark ourselves with custom dependencies, and are > successfully using Spark 3.2 with JDK17. Understanding that this > combination has likely not been tested with Celeborn, we are wondering if > there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there > is a good chance that it could still be used successfully. > > Thank you! > Curtis >
Re: Celeborn Spark dynamic allocation support (in Spark versions < 3.5)
Hi Curtis, Thanks for reaching out! You are right, after applying the corresponding patch[1], dynamic allocation will function as expected for Spark with version older than 3.5.0. [1] https://github.com/apache/incubator-celeborn?tab=readme-ov-file#support-spark-dynamic-allocation Regards, Keyong Zhou Curtis Howard 于2024年3月8日周五 22:32写道: > Hi, > > The documentation mentions Spark 3.5 or greater is required for Spark > dynamic allocation to be supported through Celeborn: > https://celeborn.apache.org/docs/latest/deploy/#spark-configuration > # Support Spark Dynamic Resource Allocation # Required Spark version >= > 3.5.0 > > Can I confirm that the only reason for this (Spark v3.5 or greater > pre-requisite) is because versions less than 3.5 require the Spark patches > described here to be integrated, first? > > https://github.com/apache/incubator-celeborn?tab=readme-ov-file#support-spark-dynamic-allocation > For example, once patched and rebuilt, Spark dynamic allocation with, say, > Spark v3.2 and Celeborn should function as expected. > > Thank you > Curtis >
[ANNOUNCE] Add Xiaofeng Jiang as new committer
Hi Celeborn Community, The Podling Project Management Committee (PPMC) for Apache Celeborn has invited Xiaofeng Jiang to become a committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since there is no need to go via the patch submission process. This should enable better productivity. A (P)PMC member helps manage and guide the direction of the project. Please join me in congratulating Xiaofeng Jiang! Thanks, Keyong Zhou
[ANNOUNCE] Add Yihe Li as new committer
Hi Celeborn Community, The Podling Project Management Committee (PPMC) for Apache Celeborn has invited Yihe Li to become a committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since there is no need to go via the patch submission process. This should enable better productivity. A (P)PMC member helps manage and guide the direction of the project. Please join me in congratulating Yihe Li! Thanks, Keyong Zhou
[ANNOUNCE] Add zhongqiangchen(Zhongqiang Chen) as new committer
Hi Celeborn(-incubating) community, I'm very excited to announce that recently we added zhongqiangchen(Zhongqiang Chen) as our new committer! zhongqiangchen has kept contributing to Celeborn for near five months, mainly on the support of Flink. Looking forward that zhongqiangchen will continue contributing to the project, pushing Celeborn to the next level together with all contributors of the community! Also, we are looking forward to add more and more committers to our project :) Thanks! Keyong Zhou