Just to point out that the benchmark you point to has Redshift running on HDD 
machines instead of SSD, and it is still faster than Shark in all but one case.

Like Gary, I'm also interested in replacing something we have on Redshift with 
Spark SQL, as it will give me much greater capability to process things. I'm 
willing to sacrifice some performance for the greater capability. But it would 
be nice to see the benchmark updated with Spark SQL, and with a more 
competitive configuration of Redshift.

Best regards, and keep up the great work!


1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

The arguments about JDBC access and simpler setup definitely make sense. My 
first non-trivial Spark application was actually an ETL process that sliced and 
diced JSON + tabular data and then loaded it into Redshift. From there on you 
got all the benefits of your average C-store database, plus the added benefit 
of Amazon managing many annoying setup and admin details for your Redshift 

One area I'm looking forward to seeing Spark SQL excel at is offering fast JDBC 
access to "raw" data--i.e. directly against S3 / HDFS; no ETL required. For 
easy and flexible data exploration, I don't think you can beat that with a 
C-store that you have to ETL stuff into.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.

This is definitely a "it depends" statement, but there is a detailed benchmark 
here<https://amplab.cs.berkeley.edu/benchmark/> comparing Shark, Redshift, and 
other systems. Have you seen it? Redshift does very well, but Shark is on par 
or better than it in most of the tests. Of course, going forward we'll want to 
see Spark SQL match this kind of performance, and that remains to be seen.


On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf 
<malouf.g...@gmail.com> wrote:
My company is leaning towards moving much of their analytics work from our own 
Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been the 
internal advocate for using Spark for analytics, but a number of good points 
have been brought up to me.  The reasons being pushed are:

- RedShift exposes a jdbc interface out of the box (no devops work there) and 
data looks and feels like it is in a normal sql database.  They want this out 
of the box from Spark, no trying to figure out which version matches this 
version of Hive/Shark/SparkSQL etc.  Yes, the next release theoretically 
supports this but there have been release issues our team has battled to date 
that erode the trust.

- Complaints around challenges we have faced running a spark shell locally 
against a cluster in EC2.  It is partly a devops issue of deploying the correct 
configurations to local machines, being able to kick a user off hogging RAM, 

- "I want to be able to run queries from my python shell against your sequence 
file data, roll it up and in the same shell leverage python graph tools."  - 
I'm not very familiar with the Python setup, but I believe by being able to run 
locally AND somehow add custom libraries to be accessed from PySpark this could 
be done.

- "Joins will perform much better (in RedShift) because it says it sorts it's 
keys.  We cannot pre-compute all joins away."

Basically, their argument is two-fold:

1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.

Anyway, a test is being setup to compare the two on the performance side but 
from a tools perspective it's hard to counter the issues that are brought up.

