My company is leaning towards moving much of their analytics work from our
own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
the internal advocate for using Spark for analytics, but a number of good
points have been brought up to me.  The reasons being pushed are:

- RedShift exposes a jdbc interface out of the box (no devops work there)
and data looks and feels like it is in a normal sql database.  They want
this out of the box from Spark, no trying to figure out which version
matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
theoretically supports this but there have been release issues our team has
battled to date that erode the trust.

- Complaints around challenges we have faced running a spark shell locally
against a cluster in EC2.  It is partly a devops issue of deploying the
correct configurations to local machines, being able to kick a user off
hogging RAM, etc.

- "I want to be able to run queries from my python shell against your
sequence file data, roll it up and in the same shell leverage python graph
tools."  - I'm not very familiar with the Python setup, but I believe by
being able to run locally AND somehow add custom libraries to be accessed
from PySpark this could be done.

- "Joins will perform much better (in RedShift) because it says it sorts
it's keys.  We cannot pre-compute all joins away."


Basically, their argument is two-fold:

1) We get tooling out of the box from RedShift (specifically, stable JDBC
access) - Spark we often are waiting for devops to get the right combo of
tools working or for libraries to support sequence files.

2) There is a belief that for many of our queries (assumed to often be
joins) a columnar database will perform orders of magnitude better.



Anyway, a test is being setup to compare the two on the performance side
but from a tools perspective it's hard to counter the issues that are
brought up.

Reply via email to