Sounds like context would help, I just didn't want to subject people to a
wall of text if it wasn't necessary :)

Currently we use neither Spark SQL (or anything in the Hadoop stack) or
Redshift.  We service templated queries from the appserver, i.e. user fills
out some forms, dropdowns: we translate to a query.

Data is "basically" one table containing thousands of independent time
series, with one or two tables of reference data to join to.  e.g. median
value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
and T2 joining on a surrogate key, group by a different Field3.  The data
structure is a little bit dynamic.  User can upload any CSV, as long as they
tell us the name of each column and the programmatic type.  The target data
size is about a billion records, 20'ish fields, distributed throughout a
year (about 50GB on disk as CSV, uncompressed).

So we're currently doing "historical" analytics (e.g. see analytic results
of only yesterday's data or older, but want to see the result "quickly").  
We eventually intend to do "realtime" (or "streaming") analytics (i.e. see
the impact of new data on analytics "quickly").  Machine learning is also on
the roadmap.

One proposition is for Spark SQL as a complete replacement for Redshift.  It
would simplify the architecture, since our long term strategy is to handle
data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
other parts of the Hadoop family that would come into play for ETL is
undetermined right now.  Spark SQL appears to have relational ability, and
if we're going to use the Hadoop stack for ML and streaming analytics, and
it has the ability, why not do it all on one stack and not shovel data
around?  Also, lots of people talking about it.

The other proposition is Redshift as the historical analytics solution, and
something else (could be Spark, doesn't matter) for streaming analytics and
ML.   If we need to relate the two, we'll have an API or process to stitch
it together.   I've read about the "lambda architecture", which more or less
describes this approach.  The motivation is Redshift has the AWS
reliability/scalability/operational concerns worked out, richer query
language (SQL and pgsql functions are designed for slice-n-dice analytics)
so we can spend our coding time elsewhere, and a measure of safety against
design issues and bugs: Spark just came out of incubator status this year,
and it's much easier to find people on the web raving positively about
Redshift in real-world usage (i.e. part of live, client-facing system) than
Spark.

category_theory's observation that most of the speed comes from fitting in
memory is helpful.  It's what I would have surmised from the AMPLab Big Data
benchmark, but confirmation from the hands-on community is invaluable, thank
you.

I understand a lot of it simply has to do with what-do-you-value-more
weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
sure if there were any other "key assumptions/requirements/gotchas" to
consider.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to