Sounds like context would help, I just didn't want to subject people to a wall of text if it wasn't necessary :)
Currently we use neither Spark SQL (or anything in the Hadoop stack) or Redshift. We service templated queries from the appserver, i.e. user fills out some forms, dropdowns: we translate to a query. Data is "basically" one table containing thousands of independent time series, with one or two tables of reference data to join to. e.g. median value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1 and T2 joining on a surrogate key, group by a different Field3. The data structure is a little bit dynamic. User can upload any CSV, as long as they tell us the name of each column and the programmatic type. The target data size is about a billion records, 20'ish fields, distributed throughout a year (about 50GB on disk as CSV, uncompressed). So we're currently doing "historical" analytics (e.g. see analytic results of only yesterday's data or older, but want to see the result "quickly"). We eventually intend to do "realtime" (or "streaming") analytics (i.e. see the impact of new data on analytics "quickly"). Machine learning is also on the roadmap. One proposition is for Spark SQL as a complete replacement for Redshift. It would simplify the architecture, since our long term strategy is to handle data intake and ETL on HDFS (regardless of Redshift or Spark SQL). The other parts of the Hadoop family that would come into play for ETL is undetermined right now. Spark SQL appears to have relational ability, and if we're going to use the Hadoop stack for ML and streaming analytics, and it has the ability, why not do it all on one stack and not shovel data around? Also, lots of people talking about it. The other proposition is Redshift as the historical analytics solution, and something else (could be Spark, doesn't matter) for streaming analytics and ML. If we need to relate the two, we'll have an API or process to stitch it together. I've read about the "lambda architecture", which more or less describes this approach. The motivation is Redshift has the AWS reliability/scalability/operational concerns worked out, richer query language (SQL and pgsql functions are designed for slice-n-dice analytics) so we can spend our coding time elsewhere, and a measure of safety against design issues and bugs: Spark just came out of incubator status this year, and it's much easier to find people on the web raving positively about Redshift in real-world usage (i.e. part of live, client-facing system) than Spark. category_theory's observation that most of the speed comes from fitting in memory is helpful. It's what I would have surmised from the AMPLab Big Data benchmark, but confirmation from the hands-on community is invaluable, thank you. I understand a lot of it simply has to do with what-do-you-value-more weightings, and we'll do prototypes/benchmarks if we have to, just wasn't sure if there were any other "key assumptions/requirements/gotchas" to consider. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org