Hi Amit,

This is very interesting indeed because I have got similar resutls. I tried
doing a filtter + groupBy using DataSet with a function, and using the
inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because
apparently there is no straight-forward way to create an RDD of Parquet
data without creating a sqlContext. if anybody has some code to share with
me, please share (:
I used 1GB of parquet data and when doing the operations with the RDD it
was much faster. After looking at the execution plans, it is clear why
DataSets do worse. For using them an extra map operation is done to map row
objects into the defined case class. Then the DataSet uses the whole query
optimization platform (Catalyst and move objects in and out of Tungsten).
Thus, I think for operations that are too "simple", it is more expensive to
use the entire DS/DF infrastructure than the inner RDD.
IMHO if you have complex SQL queries, it makes sense you use DS/DF but if
you don't, then probably using RDDs directly is still faster.


Renato M.

2016-05-11 20:17 GMT+02:00 Amit Sela <amitsel...@gmail.com>:

> Some how missed that ;)
> Anything about Datasets slowness ?
>
> On Wed, May 11, 2016, 21:02 Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Which release are you using ?
>>
>> You can use the following to disable UI:
>> --conf spark.ui.enabled=false
>>
>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela <amitsel...@gmail.com> wrote:
>>
>>> I've ran a simple WordCount example with a very small List<String> as
>>> input lines and ran it in standalone (local[*]), and Datasets is very slow..
>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>>> Is this just start-up overhead ? please note that I'm not timing the
>>> context creation...
>>>
>>> And in general, is there a way to run with local[*] "lightweight" mode
>>> for testing ? something like without the WebUI server for example (and
>>> anything else that's not needed for testing purposes)
>>>
>>> Thanks,
>>> Amit
>>>
>>
>>

Reply via email to