Taking it to a more basic level, I compared between a simple transformation
with RDDs and with Datasets. This is far simpler than Renato's use case and
this brungs up two good question:
1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is
just an additional one-time overhead - which is reasonable, especially for
the first version of datasets..
2. Is Datasets, in some cases, slower than RDDs ? if so in which, and why ?

*Datasets code*: ~2000 msec
SQLContext sqc = createSQLContext(createContext());
sqc.createDataset(WORDS, Encoders.STRING())
    .map(new MapFunction<String, String>() {
    @Override
      public String call(String value) throws Exception {
        return value.toUpperCase();
      }
    }, Encoders.STRING())
    .show();

*RDDs code*: < 500 msec
JavaSparkContext jsc = createContext();
List<String> res = jsc.parallelize(WORDS)
    .map(new Function<String, String>() {
      @Override
      public String call(String v1) throws Exception {
        return v1.toUpperCase();
      }
   })
   .collect();

*Those are the context creation functions:*
 * static SQLContext createSQLContext(JavaSparkContext jsc) {*
*    return new SQLContext(jsc);*
*  }*
*  static JavaSparkContext createContext() {*
*    return new JavaSparkContext(new
SparkConf().setMaster("local[*]").setAppName("WordCount")*
*        .set("spark.ui.enabled", "false"));*
*  }*
*And the input:*
*List<String> WORDS = Arrays.asList("hi there", "hi", "hi sue bob", "hi
sue", "bob hi");*

On Thu, May 12, 2016 at 12:04 PM Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:

> Hi Amit,
>
> This is very interesting indeed because I have got similar resutls. I
> tried doing a filtter + groupBy using DataSet with a function, and using
> the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame
> because apparently there is no straight-forward way to create an RDD of
> Parquet data without creating a sqlContext. if anybody has some code to
> share with me, please share (:
> I used 1GB of parquet data and when doing the operations with the RDD it
> was much faster. After looking at the execution plans, it is clear why
> DataSets do worse. For using them an extra map operation is done to map row
> objects into the defined case class. Then the DataSet uses the whole query
> optimization platform (Catalyst and move objects in and out of Tungsten).
> Thus, I think for operations that are too "simple", it is more expensive to
> use the entire DS/DF infrastructure than the inner RDD.
> IMHO if you have complex SQL queries, it makes sense you use DS/DF but if
> you don't, then probably using RDDs directly is still faster.
>
>
> Renato M.
>
> 2016-05-11 20:17 GMT+02:00 Amit Sela <amitsel...@gmail.com>:
>
>> Some how missed that ;)
>> Anything about Datasets slowness ?
>>
>> On Wed, May 11, 2016, 21:02 Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Which release are you using ?
>>>
>>> You can use the following to disable UI:
>>> --conf spark.ui.enabled=false
>>>
>>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela <amitsel...@gmail.com>
>>> wrote:
>>>
>>>> I've ran a simple WordCount example with a very small List<String> as
>>>> input lines and ran it in standalone (local[*]), and Datasets is very 
>>>> slow..
>>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>>>> Is this just start-up overhead ? please note that I'm not timing the
>>>> context creation...
>>>>
>>>> And in general, is there a way to run with local[*] "lightweight" mode
>>>> for testing ? something like without the WebUI server for example (and
>>>> anything else that's not needed for testing purposes)
>>>>
>>>> Thanks,
>>>> Amit
>>>>
>>>
>>>
>

Reply via email to