On 29 Aug 2016, at 20:58, Julien Dumazert 
<julien.dumaz...@gmail.com<mailto:julien.dumaz...@gmail.com>> wrote:

Hi Maciek,

I followed your recommandation and benchmarked Dataframes aggregations on 
Dataset. Here is what I got:


// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)

// Dataset with map and Dataframes sum
// 35.372s


df.as[A].map(_.fieldToSum).agg(sum("value")).collect().head.getAs[Long](0)

Not much of a difference. It seems that as soon as you access data as in RDDs, 
you force the full decoding of the object into a case class, which is super 
costly.

I find this behavior quite normal: as soon as you provide the user with the 
ability to pass a blackbox function, anything can happen, so you have to load 
the whole object. On the other hand, when using SQL-style functions only, 
everything is "white box", so Spark understands what you want to do and can 
optimize.



SWL and the dataframe code where you are asking for a specific field can be 
handled by the file format itself, so optimising the operation. If you ask for 
only one column of Parquet and orc data, then only that column's data should be 
loaded. And because they store columns together, you save on all the IO needed 
to read all the discarded columns. Add even more selectiveness (such as ranges 
in values), then you can even get "predicate pushdown" where blocks of the file 
are skipped if the input format reader can determine that none of the columns 
there match the predicate's conditions.

you should be able to ge away with something like df.select("field").... to 
filter out the fields you want first, then stay in code rather than SQL.

Anyway, experiment: its always more accurate than the opinions of others, 
especially when applied to your own datasets.

Reply via email to