I agree. My personal experience with Spark core is that it performs really well once you tune it properly.
As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. Thanks -Soumya On Fri, Oct 31, 2014 at 7:35 PM, Du Li <l...@yahoo-inc.com> wrote: > We have seen all kinds of results published that often contradict each > other. My take is that the authors often know more tricks about how to tune > their own/familiar products than the others. So the product on focus is > tuned for ideal performance while the competitors are not. The authors are > not necessarily biased but as a consequence the results are. > > Ideally it’s critical for the user community to be informed of all the > in-depth tuning tricks of all products. However, realistically, there is a > big gap in terms of documentation. Hope the Spark folks will make a > difference. :-) > > Du > > > From: Soumya Simanta <soumya.sima...@gmail.com> > Date: Friday, October 31, 2014 at 4:04 PM > To: "user@spark.apache.org" <user@spark.apache.org> > Subject: SparkSQL performance > > I was really surprised to see the results here, esp. SparkSQL "not > completing" > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style > > I was under the impression that SparkSQL performs really well because it > can optimize the RDD operations and load only the columns that are > required. This essentially means in most cases SparkSQL should be as fast > as Spark is. > > I would be very interested to hear what others in the group have to say > about this. > > Thanks > -Soumya > > >