The performance difference is coming from the need to serialize and
deserialize data to AnnotationText. The extra stage is probably very quick
and shouldn't impact much.

If you try cache the RDD using serialized mode, it would slow down a lot
too.


On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath <ddmcbe...@yahoo.com.invalid>
wrote:

> I started playing round with Datasets on Spark 2.0 this morning and I'm
> surprised by the significant performance difference I'm seeing between an
> RDD and a Dataset for a very basic example.
>
>
> I've defined a simple case class called AnnotationText that has a handful
> of fields.
>
>
> I create a Dataset[AnnotationText] with my data and repartition(4) this on
> one of the columns and cache the resulting dataset as ds (force the cache
> by executing a count action).  Everything looks good and I have more than
> 10M records in my dataset ds.
>
> When I execute the following:
>
> ds.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under 3 seconds.  One of the things I
> notice is that it has 3 stages.  The first stage is skipped (as this had to
> do with creation ds and it was already cached).  The second stage appears
> to do the filtering (requires 4 tasks) but interestingly it shuffles
> output.  The third stage (requires only 1 task) appears to count the
> results of the shuffle.
>
> When I look at the cached dataset (on 4 partitions) it is 82.6MB.
>
> I then decided to convert the ds dataset to an RDD as follows,
> repartition(4) and cache.
>
> val aRDD = ds.rdd.repartition(4).cache
> aRDD.count
> So, I now have an RDD[AnnotationText]
>
> When I execute the following:
>
> aRDD.filter(textAnnotation => textAnnotation.text ==
> "mutational".toLowerCase).count
>
> It consistently finishes in just under half a second.  One of the things I
> notice is that it only has 2 stages.  The first stage is skipped (as this
> had to do with creation of aRDD and it was already cached).  The second
> stage appears to do the filtering and count(requires 4 tasks).
> Interestingly, there is no shuffle (or subsequently 3rd stage).
>
> When I look at the cached RDD (on 4 partitions) it is 2.9GB.
>
>
> I was surprised how significant the cached storage difference was between
> the Dataset (82.6MB) and the RDD (2.9GB) version of the same content.  Is
> this kind of difference to be expected?
>
> While I like the smaller size for the Dataset version, I was confused as
> to why the performance for the Dataset version was so much slower (2.5s vs
> .5s).  I suspect it might be attributed to the shuffle and third stage
> required by the Dataset version but I'm not sure. I was under the
> impression that Datasets should (would) be faster in many use cases (such
> as the one I'm using above).  Am I doing something wrong or is this to be
> expected?
>
> Thanks.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to