Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-13 Thread Amit Sela
Taking it to a more basic level, I compared between a simple transformation with RDDs and with Datasets. This is far simpler than Renato's use case and this brungs up two good question: 1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is just an additional one-time overhead

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-12 Thread Renato MarroquĂ­n Mogrovejo
Hi Amit, This is very interesting indeed because I have got similar resutls. I tried doing a filtter + groupBy using DataSet with a function, and using the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because apparently there is no straight-forward way to create an RDD of

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
Some how missed that ;) Anything about Datasets slowness ? On Wed, May 11, 2016, 21:02 Ted Yu wrote: > Which release are you using ? > > You can use the following to disable UI: > --conf spark.ui.enabled=false > > On Wed, May 11, 2016 at 10:59 AM, Amit Sela

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Ted Yu
Which release are you using ? You can use the following to disable UI: --conf spark.ui.enabled=false On Wed, May 11, 2016 at 10:59 AM, Amit Sela wrote: > I've ran a simple WordCount example with a very small List as > input lines and ran it in standalone (local[*]), and

Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
I've ran a simple WordCount example with a very small List as input lines and ran it in standalone (local[*]), and Datasets is very slow.. We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec. Is this just start-up overhead ? please note that I'm not timing the context creation... And