I started playing round with Datasets on Spark 2.0 this morning and I'm 
surprised by the significant performance difference I'm seeing between an RDD 
and a Dataset for a very basic example.


I've defined a simple case class called AnnotationText that has a handful of 
fields.


I create a Dataset[AnnotationText] with my data and repartition(4) this on one 
of the columns and cache the resulting dataset as ds (force the cache by 
executing a count action).  Everything looks good and I have more than 10M 
records in my dataset ds.

When I execute the following:

ds.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count 

It consistently finishes in just under 3 seconds.  One of the things I notice 
is that it has 3 stages.  The first stage is skipped (as this had to do with 
creation ds and it was already cached).  The second stage appears to do the 
filtering (requires 4 tasks) but interestingly it shuffles output.  The third 
stage (requires only 1 task) appears to count the results of the shuffle.  

When I look at the cached dataset (on 4 partitions) it is 82.6MB.

I then decided to convert the ds dataset to an RDD as follows, repartition(4) 
and cache.

val aRDD = ds.rdd.repartition(4).cache
aRDD.count
So, I now have an RDD[AnnotationText]

When I execute the following:

aRDD.filter(textAnnotation => textAnnotation.text == 
"mutational".toLowerCase).count

It consistently finishes in just under half a second.  One of the things I 
notice is that it only has 2 stages.  The first stage is skipped (as this had 
to do with creation of aRDD and it was already cached).  The second stage 
appears to do the filtering and count(requires 4 tasks).  Interestingly, there 
is no shuffle (or subsequently 3rd stage).   

When I look at the cached RDD (on 4 partitions) it is 2.9GB.


I was surprised how significant the cached storage difference was between the 
Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is this kind 
of difference to be expected?

While I like the smaller size for the Dataset version, I was confused as to why 
the performance for the Dataset version was so much slower (2.5s vs .5s).  I 
suspect it might be attributed to the shuffle and third stage required by the 
Dataset version but I'm not sure. I was under the impression that Datasets 
should (would) be faster in many use cases (such as the one I'm using above).  
Am I doing something wrong or is this to be expected?

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to