Re: Performance of loading parquet files into case classes in Spark

2016-08-30 Thread Steve Loughran
On 29 Aug 2016, at 20:58, Julien Dumazert > wrote: Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s

Re: Performance of loading parquet files into case classes in Spark

2016-08-29 Thread Julien Dumazert
Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset with map and Dataframes sum // 35.372s

Re: Performance of loading parquet files into case classes in Spark

2016-08-28 Thread Julien Dumazert
Hi Maciek, I've tested several variants for summing "fieldToSum": First, RDD-style code: df.as[A].map(_.fieldToSum).reduce(_ + _) df.as[A].rdd.map(_.fieldToSum).sum() df.as[A].map(_.fieldToSum).rdd.sum() All around 30 seconds. "reduce" and "sum" seem to have the same performance, for this use

Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej BryƄski
2016-08-27 15:27 GMT+02:00 Julien Dumazert : > df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _) I think reduce and sum has very different performance. Did you try sql.functions.sum ? Or of you want to benchmark access to Row object then count() function

Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Julien Dumazert
Hi all, I'm forwarding a question I recently asked on Stack Overflow about benchmarking Spark performance when working with case classes stored in Parquet files. I am assessing the