On 29 Aug 2016, at 20:58, Julien Dumazert <julien.dumaz...@gmail.com<mailto:julien.dumaz...@gmail.com>> wrote:
Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset with map and Dataframes sum // 35.372s df.as[A].map(_.fieldToSum).agg(sum("value")).collect().head.getAs[Long](0) Not much of a difference. It seems that as soon as you access data as in RDDs, you force the full decoding of the object into a case class, which is super costly. I find this behavior quite normal: as soon as you provide the user with the ability to pass a blackbox function, anything can happen, so you have to load the whole object. On the other hand, when using SQL-style functions only, everything is "white box", so Spark understands what you want to do and can optimize. SWL and the dataframe code where you are asking for a specific field can be handled by the file format itself, so optimising the operation. If you ask for only one column of Parquet and orc data, then only that column's data should be loaded. And because they store columns together, you save on all the IO needed to read all the discarded columns. Add even more selectiveness (such as ranges in values), then you can even get "predicate pushdown" where blocks of the file are skipped if the input format reader can determine that none of the columns there match the predicate's conditions. you should be able to ge away with something like df.select("field").... to filter out the fields you want first, then stay in code rather than SQL. Anyway, experiment: its always more accurate than the opinions of others, especially when applied to your own datasets.