DataFrame distinct vs RDD distinct

2015-05-07 Thread Olivier Girardot
Hi everyone, there seems to be different implementations of the "distinct" feature in DataFrames and RDD and some performance issue with the DataFrame distinct API. In RDD.scala : def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { map(x => (x, null)).reduceBy

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Reynold Xin
In 1.5, we will most likely just rewrite distinct in SQL to either use the Aggregate operator which will benefit from all the Tungsten optimizations, or have a Tungsten version of distinct for SQL/DataFrame. On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote:

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Olivier Girardot
Ok, but for the moment, this seems to be killing performances on some computations... I'll try to give you precise figures on this between rdd and dataframe. Olivier. Le jeu. 7 mai 2015 à 10:08, Reynold Xin a écrit : > In 1.5, we will most likely just rewrite distinct in SQL to either use the >

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Michael Armbrust
I'd happily merge a PR that changes the distinct implementation to be more like Spark core, assuming it includes benchmarks that show better performance for both the "fits in memory case" and the "too big for memory case". On Thu, May 7, 2015 at 2:23 AM, Olivier Girardot < o.girar...@lateral-thoug

Re: DataFrame distinct vs RDD distinct

2015-05-08 Thread Olivier Girardot
I'll try to reproduce what has been reported to me first :) and I'll let you know. Thanks ! Le jeu. 7 mai 2015 à 21:16, Michael Armbrust a écrit : > I'd happily merge a PR that changes the distinct implementation to be more > like Spark core, assuming it includes benchmarks that show better > pe

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
Frame distinct vs RDD distinct I'll try to reproduce what has been reported to me first :) and I'll let you know. Thanks ! Le jeu. 7 mai 2015 à 21:16, Michael Armbrust a écrit : > I'd happily merge a PR that changes the distinct implementation to be > more like Spark core, assum

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
s1, s2) => s1 ++= s2) Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Monday, May 11, 2015 11:59 AM To: Olivier Girardot; Michael Armbrust Cc: Reynold Xin; dev@spark.apache.org Subject: RE: DataFrame distinct vs RDD distinct Hi, Could you suggest alternative way