Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Not that I know of. We did do some work to make it work faster in the case of lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949 On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > BTW, if this is known, is there an existing JIRA I should link

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
BTW, if this is known, is there an existing JIRA I should link to? On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson wrote: > > At a high level, some candidate strategies are: > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF > trait itself) so that the update method can

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
They are unfortunately all pretty substantial (which is why this problem exists) ... On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote: > > At a high level, some candidate strategies are: > > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
At a high level, some candidate strategies are: 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF trait itself) so that the update method can do the right thing. 2. Expose TypedImperativeAggregate to users for defining their own, since it already does the right thing. 3.

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Yes this is known and an issue for performance. Do you have any thoughts on how to fix this? On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote: > I describe some of the details here: > https://issues.apache.org/jira/browse/SPARK-27296 > > The short version of the story is that aggregating

UDAFs have an inefficiency problem

2019-03-27 Thread Erik Erlandson
I describe some of the details here: https://issues.apache.org/jira/browse/SPARK-27296 The short version of the story is that aggregating data structures (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for every row in a data frame. Cheers, Erik

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread Sean Owen
+1 from me - same as last time. On Wed, Mar 27, 2019 at 1:31 PM DB Tsai wrote: > > Please vote on releasing the following candidate as Apache Spark version > 2.4.1. > > The vote is open until March 30 PST and passes if a majority +1 PMC votes are > cast, with > a minimum of 3 +1 votes. > > [ ]

Spark benchmarks

2019-03-27 Thread Michael Mior
I'm looking for recommendations on benchmarks for Spark. I'm familiar with spark-bench[0], but I haven't found much else that suits my needs. The main property I'm looking for is that the workload of the benchmark should benefit significantly from non-trivial use of Spark's caching mechanism since

Re: [VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread Wenchen Fan
+1, all the known blockers are resolved. Thanks for driving this! On Wed, Mar 27, 2019 at 11:31 AM DB Tsai wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.1. > > The vote is open until March 30 PST and passes if a majority +1 PMC votes > are cast, with >

[VOTE] Release Apache Spark 2.4.1 (RC9)

2019-03-27 Thread DB Tsai
Please vote on releasing the following candidate as Apache Spark version 2.4.1. The vote is open until March 30 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 2.4.1 [ ] -1 Do not release this package because ... To

Re: [DISCUSS] Spark Columnar Processing

2019-03-27 Thread Bobby Evans
Kazuaki Ishizaki, Yes, ColumnarBatchScan does provide a framework for doing code generation for the processing of columnar data. I have to admit that I don't have a deep understanding of the code generation piece, so if I get something wrong please correct me. From what I had seen only input

Re:RE: How to build single jar for single project in spark

2019-03-27 Thread zhangliyun
thanks a lot, seem work now and save a lot of time Best Regards Zhang,Liyun/Kelly Zhang At 2019-03-26 17:49:03, "Ajith shetty" wrote: You can try using -pl maven option for this > mvn clean install -pl :spark-core_2.11 From:Qiu, Gerry To:zhangliyun ;dev@spark.apache.org

[GraphX] How to Controlling Partition Placement?

2019-03-27 Thread Top Trending
Hi, I want to control the placement of the partitions of the Property Graph across my cluster nodes. As I understand, in order to specify the preferred locations for a partition of an RDD, one will need to create a subclass that overrides the getPreferredLocations() function. For example the