Re: EMR + celeborn

2024-06-17 Thread Keyong Zhou
Hi rahul,

Are you referring to AWS EMR or Aliyun EMR? You are right that Spark driver
and executors talk directly to Celeborn Master and Workers, as long as the
network are connected,
there should be no issue.

For Aliyun EMR, if the EMR cluster and Celeborn are in the same VPC, they
can directly talk to each other without other setup. I'm not that familiar
with AWS EMR, but I think
it should be similar. I'll ask some of the Celeborn users to see if they
can provide some information.

BTW, if you are using AWS EMR with Celeborn, please make sure the following
config is set to false:
spark.celeborn.client.spark.push.unsafeRow.fastWrite.enabled=false

Regards,
Keyong Zhou

rahul gidwani  于2024年6月18日周二 04:53写道:

> Hello,
>
> I was wondering if anyone here has tried running celeborn in k8s for spark
> on EMR?  Our spark clusters run in EMR currently and want to utilize
> celeborn as our shuffle, but was wondering if there was any tricky
> networking setup to get this working?  Since I assume that any node in EMR
> would eventually talk directly to a worker node.
>
> Thank you
>
>


Necessary config for AWS Spark

2024-05-15 Thread Keyong Zhou
Dear Celeborn users,

If you are running AWS Spark with Celeborn, please make sure the following
spark config
is added, or there might be undefined behavior:

spark.celeborn.client.spark.push.unsafeRow.fastWrite.enabled=false

Regards,
Keyong Zhou


[ANNOUNCE] Add Mridul Muralidharan as new committer

2024-04-28 Thread Keyong Zhou
Hi Celeborn Community,

The Project Management Committee (PMC) for Apache Celeborn
has invited Mridul Muralidharan to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A PMC member helps manage and guide the direction of the project.

Please join me in congratulating Mridul Muralidharan!

Regards,
Keyong Zhou


Re: How does Celeborn handles FetchFailed exception

2024-04-03 Thread Keyong Zhou
Hi Sanskar,

Thanks for your interest in Celeborn! Here is design doc[1] and and this
thread[2] has extensive discussion,
which I believe can be helpful.

For your question, LifecycleManager registers shuffle tracker callback
after creation, which calls unregisterAllMapAndMergeOutput:

lifecycleManager.registerShuffleTrackerCallback(
shuffleId -> SparkUtils.unregisterAllMapOutput(mapOutputTracker,
shuffleId));


The callback will be called when LifecycleManager handles shuffle fetch
failure:

  private def handleReportShuffleFetchFailure(
  context: RpcCallContext,
  appShuffleId: Int,
  shuffleId: Int): Unit = {
   ...

  appShuffleTrackerCallback match {
case Some(callback) =>
  try {
callback.accept(appShuffleId)
 ...
}

And ShuffleClient inside Executors reports shuffle fetch failure whenever
fetch failure is encountered:

  public boolean reportShuffleFetchFailure(int appShuffleId, int shuffleId)
{
...
PbReportShuffleFetchFailureResponse pbReportShuffleFetchFailureResponse
=
lifecycleManagerRef.askSync(
pbReportShuffleFetchFailure,
conf.clientRpcRegisterShuffleRpcAskTimeout(),

ClassTag$.MODULE$.apply(PbReportShuffleFetchFailureResponse.class));
...
  }

cc Erik, Mridul

[1]
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8
[2] https://lists.apache.org/thread/nogsgc61qh2zomdvhfbs4b0y88s3qtqg

Regards,
Keyong Zhou

Sanskar Modi  于2024年4月3日周三 18:35写道:

> Hi Celebron community,
>
> I wanted to understand better on how Celeborn handles FetchFailed
> exceptions. In Spark `DAGScheduler` fetch failure handling, code tries to
> unregister the map output for the fetch failed mapIndex
>
> } else if (mapIndex != -1) {
>   // Mark the map whose fetch failed as broken in the map stage
>   mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)
>
> }
>
> But in the Celeborn case mapIndex will always be -1. So how does the
> shuffle output get cleared for the same.
> Ideally mapOutputTracker.unregisterAllMapAndMergeOutput(shuffleId) should
> be called for the fetch failed stage but I'm not able to find that code
> piece.
>
> Can someone help me understand this, I might be missing something basic
> here.
>


[ANNOUNCE] Add Chandni Singh as new committer

2024-03-21 Thread Keyong Zhou
Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Chandni Singh to become a committer and we are pleased
to announce that she has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Chandni Singh!

Thanks,
Keyong Zhou


Re: Celeborn for Spark 3.2 + JDK17

2024-03-12 Thread Keyong Zhou
I think so. Do you mind sending a PR to Celeborn? :)

Regards,
Keyong Zhou

Curtis Howard 于2024年3月12日 周二22:58写道:

> Thanks Keyong,
>
> I was able to build the 'client' JARs for Spark 3.2 / JDK21, and run
> simple tests; however I did face a runtime error similar to this:
>
> Caused by: java.lang.ExceptionInInitializerError: Exception
> java.lang.IllegalStateException: java.lang.NoSuchMethodException:
> java.nio.DirectByteBuffer.(long, int) [in thread "Executor task
> launch worker for task 0.0 in stage 0.0 (TID 0)"]
>
> at
> org.apache.celeborn.common.unsafe.Platform.(Platform.java:135)
>
> ... 16 more
>
> I used the following patch (borrowed from the almost identical upstream
> Spark project patch for SPARK-42369
> <https://github.com/apache/spark/pull/39909> that was required for JDK21,
> as a result of JDK-8303083 <https://bugs.openjdk.org/browse/JDK-8303083>),
> to work around this successfully (I think it may be required in Celeborn
> Platform.java as well, for JDK21?):
>
> diff --git
> a/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java
> b/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java
>
> index ec541a77..218d517b 100644
>
> --- a/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java
>
> +++ b/common/src/main/java/org/apache/celeborn/common/unsafe/Platform.java
>
> @ -90,7 +90,15 @@ public final class Platform {
>
>  }
>
>  try {
>
>Class cls = Class.forName("java.nio.DirectByteBuffer");
>
> -  Constructor constructor = cls.getDeclaredConstructor(Long.TYPE,
> Integer.TYPE);
>
> +  Constructor constructor;
>
> +  try {
>
> +constructor = cls.getDeclaredConstructor(Long.TYPE,
> Integer.TYPE);
>
> +  } catch (NoSuchMethodException e) {
>
> +// DirectByteBuffer(long,int) was removed in
>
> +//
> https://github.com/openjdk/jdk/commit/a56598f5a534cc9223367e7faa8433ea38661db9
>
> +constructor = cls.getDeclaredConstructor(Long.TYPE, Long.TYPE);
>
> +  }
>
> +
>
>   Field cleanerField = cls.getDeclaredField("cleaner");
>
>   try {
>
> constructor.setAccessible(true);
>
>
>
>
> Curtis
>
>
>
> On Tue, Mar 12, 2024 at 10:32 AM Keyong Zhou  wrote:
>
>> Hi Curtis,
>>
>> With this PR https://github.com/apache/incubator-celeborn/pull/2385 you
>> can compile with JDK21 using following commands:
>>  ./build/make-distribution.sh -Pspark-3.5 -Pjdk-21
>>
>> Regards,
>> Keyong Zhou
>>
>> Curtis Howard  于2024年3月9日周六 02:38写道:
>>
>>> Thank you Keyong!
>>>
>>> Related to this, has testing started for Celeborn with JDK21?  (any
>>> anticipated concerns there, based on what you know so far?).
>>> We will be migrating to JDK21 shortly, which is why I ask.
>>>
>>> Thanks again
>>> Curtis
>>>
>>> On Fri, Mar 8, 2024 at 11:05 AM Keyong Zhou  wrote:
>>>
>>>> Hi Curtis,
>>>>
>>>> Thanks for reaching out!
>>>>
>>>> No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I
>>>> think there is a good chance that it could be used successfully.
>>>>
>>>> Any problems with your test, feel free to let us know :)
>>>>
>>>> Regards,
>>>> Keyong Zhou
>>>>
>>>> Curtis Howard  于2024年3月8日周五 22:23写道:
>>>>
>>>>> Hi,
>>>>>
>>>>> We would like to confirm the reason for Celeborn not being listed with
>>>>> Spark 3.2 and JDK17, as shown in the compatibility matrix here:
>>>>> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build
>>>>>
>>>>> Is the reason for this only because the Apache Spark 3.2 release does
>>>>> not officially support JDK17? (as covered in
>>>>> https://issues.apache.org/jira/browse/SPARK-33772), or have
>>>>> another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17
>>>>> version combination?
>>>>>
>>>>> We currently build Spark ourselves with custom dependencies, and are
>>>>> successfully using Spark 3.2 with JDK17.  Understanding that this
>>>>> combination has likely not been tested with Celeborn, we are wondering if
>>>>> there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there
>>>>> is a good chance that it could still be used successfully.
>>>>>
>>>>> Thank you!
>>>>> Curtis
>>>>>
>>>>


Re: Celeborn for Spark 3.2 + JDK17

2024-03-12 Thread Keyong Zhou
Hi Curtis,

With this PR https://github.com/apache/incubator-celeborn/pull/2385 you can
compile with JDK21 using following commands:
 ./build/make-distribution.sh -Pspark-3.5 -Pjdk-21

Regards,
Keyong Zhou

Curtis Howard  于2024年3月9日周六 02:38写道:

> Thank you Keyong!
>
> Related to this, has testing started for Celeborn with JDK21?  (any
> anticipated concerns there, based on what you know so far?).
> We will be migrating to JDK21 shortly, which is why I ask.
>
> Thanks again
> Curtis
>
> On Fri, Mar 8, 2024 at 11:05 AM Keyong Zhou  wrote:
>
>> Hi Curtis,
>>
>> Thanks for reaching out!
>>
>> No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I
>> think there is a good chance that it could be used successfully.
>>
>> Any problems with your test, feel free to let us know :)
>>
>> Regards,
>> Keyong Zhou
>>
>> Curtis Howard  于2024年3月8日周五 22:23写道:
>>
>>> Hi,
>>>
>>> We would like to confirm the reason for Celeborn not being listed with
>>> Spark 3.2 and JDK17, as shown in the compatibility matrix here:
>>> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build
>>>
>>> Is the reason for this only because the Apache Spark 3.2 release does
>>> not officially support JDK17? (as covered in
>>> https://issues.apache.org/jira/browse/SPARK-33772), or have
>>> another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17
>>> version combination?
>>>
>>> We currently build Spark ourselves with custom dependencies, and are
>>> successfully using Spark 3.2 with JDK17.  Understanding that this
>>> combination has likely not been tested with Celeborn, we are wondering if
>>> there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there
>>> is a good chance that it could still be used successfully.
>>>
>>> Thank you!
>>> Curtis
>>>
>>


Re: Celeborn for Spark 3.2 + JDK17

2024-03-08 Thread Keyong Zhou
Hi Curtis,

Thanks for reaching out!

No there is no known blockers for Celeborn + Spark 3.2 + JDK17, and I think
there is a good chance that it could be used successfully.

Any problems with your test, feel free to let us know :)

Regards,
Keyong Zhou

Curtis Howard  于2024年3月8日周五 22:23写道:

> Hi,
>
> We would like to confirm the reason for Celeborn not being listed with
> Spark 3.2 and JDK17, as shown in the compatibility matrix here:
> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#build
>
> Is the reason for this only because the Apache Spark 3.2 release does not
> officially support JDK17? (as covered in
> https://issues.apache.org/jira/browse/SPARK-33772), or have
> another Celeborn-specific conflicts been found, with the Spark 3.2 + JDK17
> version combination?
>
> We currently build Spark ourselves with custom dependencies, and are
> successfully using Spark 3.2 with JDK17.  Understanding that this
> combination has likely not been tested with Celeborn, we are wondering if
> there are any known blockers for Celeborn + Spark 3.2 + JDK17, or if there
> is a good chance that it could still be used successfully.
>
> Thank you!
> Curtis
>


Re: Celeborn Spark dynamic allocation support (in Spark versions < 3.5)

2024-03-08 Thread Keyong Zhou
Hi Curtis,

Thanks for reaching out! You are right, after applying the corresponding
patch[1],
dynamic allocation will function as expected for Spark with version older
than 3.5.0.

[1]
https://github.com/apache/incubator-celeborn?tab=readme-ov-file#support-spark-dynamic-allocation

Regards,
Keyong Zhou

Curtis Howard  于2024年3月8日周五 22:32写道:

> Hi,
>
> The documentation mentions Spark 3.5 or greater is required for Spark
> dynamic allocation to be supported through Celeborn:
> https://celeborn.apache.org/docs/latest/deploy/#spark-configuration
> # Support Spark Dynamic Resource Allocation # Required Spark version >=
> 3.5.0
>
> Can I confirm that the only reason for this (Spark v3.5 or greater
> pre-requisite) is because versions less than 3.5 require the Spark patches
> described here to be integrated, first?
>
> https://github.com/apache/incubator-celeborn?tab=readme-ov-file#support-spark-dynamic-allocation
> For example, once patched and rebuilt, Spark dynamic allocation with, say,
> Spark v3.2 and Celeborn should function as expected.
>
> Thank you
> Curtis
>


[ANNOUNCE] Add Xiaofeng Jiang as new committer

2024-01-11 Thread Keyong Zhou
Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Xiaofeng Jiang to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Xiaofeng Jiang!

Thanks,
Keyong Zhou


[ANNOUNCE] Add Yihe Li as new committer

2023-11-16 Thread Keyong Zhou
Hi Celeborn Community,

The Podling Project Management Committee (PPMC) for Apache Celeborn
has invited Yihe Li to become a committer and we are pleased
to announce that he has accepted.

Being a committer enables easier contribution to the
project since there is no need to go via the patch
submission process. This should enable better productivity.
A (P)PMC member helps manage and guide the direction of the project.

Please join me in congratulating Yihe Li!

Thanks,
Keyong Zhou


[ANNOUNCE] Add zhongqiangchen(Zhongqiang Chen) as new committer

2023-03-14 Thread Keyong Zhou
Hi Celeborn(-incubating) community,

I'm very excited to announce that recently we
added zhongqiangchen(Zhongqiang Chen) as our new committer!

zhongqiangchen has kept contributing to Celeborn for near five months,
mainly on the support of Flink.
Looking forward that zhongqiangchen will continue contributing to the
project, pushing Celeborn to the next level together with all contributors
of the community!

Also, we are looking forward to add more and more committers to our project
:)

Thanks!
Keyong Zhou