Taking it to a more basic level, I compared between a simple transformation with RDDs and with Datasets. This is far simpler than Renato's use case and this brungs up two good question: 1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is just an additional one-time overhead - which is reasonable, especially for the first version of datasets.. 2. Is Datasets, in some cases, slower than RDDs ? if so in which, and why ?
*Datasets code*: ~2000 msec SQLContext sqc = createSQLContext(createContext()); sqc.createDataset(WORDS, Encoders.STRING()) .map(new MapFunction<String, String>() { @Override public String call(String value) throws Exception { return value.toUpperCase(); } }, Encoders.STRING()) .show(); *RDDs code*: < 500 msec JavaSparkContext jsc = createContext(); List<String> res = jsc.parallelize(WORDS) .map(new Function<String, String>() { @Override public String call(String v1) throws Exception { return v1.toUpperCase(); } }) .collect(); *Those are the context creation functions:* * static SQLContext createSQLContext(JavaSparkContext jsc) {* * return new SQLContext(jsc);* * }* * static JavaSparkContext createContext() {* * return new JavaSparkContext(new SparkConf().setMaster("local[*]").setAppName("WordCount")* * .set("spark.ui.enabled", "false"));* * }* *And the input:* *List<String> WORDS = Arrays.asList("hi there", "hi", "hi sue bob", "hi sue", "bob hi");* On Thu, May 12, 2016 at 12:04 PM Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Hi Amit, > > This is very interesting indeed because I have got similar resutls. I > tried doing a filtter + groupBy using DataSet with a function, and using > the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame > because apparently there is no straight-forward way to create an RDD of > Parquet data without creating a sqlContext. if anybody has some code to > share with me, please share (: > I used 1GB of parquet data and when doing the operations with the RDD it > was much faster. After looking at the execution plans, it is clear why > DataSets do worse. For using them an extra map operation is done to map row > objects into the defined case class. Then the DataSet uses the whole query > optimization platform (Catalyst and move objects in and out of Tungsten). > Thus, I think for operations that are too "simple", it is more expensive to > use the entire DS/DF infrastructure than the inner RDD. > IMHO if you have complex SQL queries, it makes sense you use DS/DF but if > you don't, then probably using RDDs directly is still faster. > > > Renato M. > > 2016-05-11 20:17 GMT+02:00 Amit Sela <amitsel...@gmail.com>: > >> Some how missed that ;) >> Anything about Datasets slowness ? >> >> On Wed, May 11, 2016, 21:02 Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Which release are you using ? >>> >>> You can use the following to disable UI: >>> --conf spark.ui.enabled=false >>> >>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela <amitsel...@gmail.com> >>> wrote: >>> >>>> I've ran a simple WordCount example with a very small List<String> as >>>> input lines and ran it in standalone (local[*]), and Datasets is very >>>> slow.. >>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec. >>>> Is this just start-up overhead ? please note that I'm not timing the >>>> context creation... >>>> >>>> And in general, is there a way to run with local[*] "lightweight" mode >>>> for testing ? something like without the WebUI server for example (and >>>> anything else that's not needed for testing purposes) >>>> >>>> Thanks, >>>> Amit >>>> >>> >>> >