Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
I'd like to start a discussion on enabling push-based shuffle in Spark. This is meant to address issues with existing shuffle inefficiency in a large-scale Spark compute infra deployment. Facebook's previous talks on SOS shuffle and

Re: Correctness and data loss issues

2020-01-21 Thread Wenchen Fan
I think we need to go through them during the 3.0 QA period, and try to fix the valid ones. For example, the first ticket should be fixed already in https://issues.apache.org/jira/browse/SPARK-28344 On Mon, Jan 20, 2020 at 2:07 PM Dongjoon Hyun wrote: > Hi, All. > > According to our policy,

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Hyukjin Kwon
+1. If it becomes a problem for any reason, we can consider another option ( https://github.com/apache/spark/pull/27307#issuecomment-576951473) later 2020년 1월 22일 (수) 오전 8:23, Dongjoon Hyun 님이 작성: > +1, I'm supporting the following proposal. > > > this mirror as the primary repo in the build,

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin
Thanks for writing this up.  Usually when people talk about push-based shuffle, they are motivating it primarily to reduce the latency of short queries, by pipelining the map phase, shuffle phase, and the reduce phase (which this design isn't going to address). It's interesting you are

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Reynold Xin
If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead. If your UDF is cheap, it will help tremendously. On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote: > > > > Hi,

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Reynold Xin
This seems reasonable! On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1, I'm supporting the following proposal. > > > > this mirror as the primary repo in the build, falling back to Central if > needed. > > > Thanks, > Dongjoon. > > > > On Tue,

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread mshen
Hi Reynold, Thanks for the comments. Although in the SPIP doc, a big portion of the problem motivation is around optimizing small random reads for shuffle, I believe the benefit of this design is beyond that. In terms of the approach we take, it is true that the map phase would still need to

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Walaa Eldin Moustafa
Hi, At LinkedIn, we have some benchmarks that show that UDFs in the Expression API are more performant than Hive Generic UDFs (I am not sure which APIs you used to implement your baseline, but I expect Scala UDFs or Hive Generic UDFs). In fact, we have built a full fledged UDF API (scalar for

Re: Correctness and data loss issues

2020-01-21 Thread Dongjoon Hyun
Thank you for checking, Wenchen! Sure, we need to do that. Another question is "What can we do for 2.4.5 release"? Some of the fixes cannot be backported due to the technical difficulty like the followings. 1. https://issues.apache.org/jira/browse/SPARK-26154 Stream-stream joins -

Call for presentations for ApacheCon North America 2020 now open

2020-01-21 Thread Rich Bowen
Dear Apache enthusiast, (You’re receiving this message because you are subscribed to one or more project mailing lists at the Apache Software Foundation.) The call for presentations for ApacheCon North America 2020 is now open at https://apachecon.com/acna2020/cfp ApacheCon will be held at

Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Sean Owen
See https://github.com/apache/spark/pull/27307 for some context. We've had to add, in at least one place, some settings to resolve artifacts from a mirror besides Maven Central to work around some build problems. Now, we find it might be simpler to just use this mirror as the primary repo in the

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Dongjoon Hyun
+1, I'm supporting the following proposal. > this mirror as the primary repo in the build, falling back to Central if needed. Thanks, Dongjoon. On Tue, Jan 21, 2020 at 14:37 Sean Owen wrote: > See https://github.com/apache/spark/pull/27307 for some context. We've > had to add, in at least