My company is leaning towards moving much of their analytics work from our own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been the internal advocate for using Spark for analytics, but a number of good points have been brought up to me. The reasons being pushed are:
- RedShift exposes a jdbc interface out of the box (no devops work there) and data looks and feels like it is in a normal sql database. They want this out of the box from Spark, no trying to figure out which version matches this version of Hive/Shark/SparkSQL etc. Yes, the next release theoretically supports this but there have been release issues our team has battled to date that erode the trust. - Complaints around challenges we have faced running a spark shell locally against a cluster in EC2. It is partly a devops issue of deploying the correct configurations to local machines, being able to kick a user off hogging RAM, etc. - "I want to be able to run queries from my python shell against your sequence file data, roll it up and in the same shell leverage python graph tools." - I'm not very familiar with the Python setup, but I believe by being able to run locally AND somehow add custom libraries to be accessed from PySpark this could be done. - "Joins will perform much better (in RedShift) because it says it sorts it's keys. We cannot pre-compute all joins away." Basically, their argument is two-fold: 1) We get tooling out of the box from RedShift (specifically, stable JDBC access) - Spark we often are waiting for devops to get the right combo of tools working or for libraries to support sequence files. 2) There is a belief that for many of our queries (assumed to often be joins) a columnar database will perform orders of magnitude better. Anyway, a test is being setup to compare the two on the performance side but from a tools perspective it's hard to counter the issues that are brought up.