Re: Spark v Redshift
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box. Matei On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote: I'm in the midst of a heated debate about the use of Redshift v Spark with a colleague. We keep trading anecdotes and links back and forth (eg airbnb post from 2013 or amplab benchmarks), and we don't seem to be getting anywhere. So before we start down the prototype /benchmark road, and in desperation of finding *some* kind of objective third party perspective, was wondering if anyone who has used both in 2014 would care to provide commentary about the sweet spot use cases / gotchas for non trivial use (eg a simple filter scan isn't really interesting). Soft issues like operational maintenance and time spent developing v out of the box are interesting too... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark v Redshift
BTW while I haven't actually used Redshift, I've seen many companies that use both, usually using Spark for ETL and advanced analytics and Redshift for SQL on the cleaned / summarized data. Xiangrui Meng also wrote https://github.com/mengxr/redshift-input-format to make it easy to read data exported from Redshift into Spark or Hadoop. Matei On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box. Matei On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote: I'm in the midst of a heated debate about the use of Redshift v Spark with a colleague. We keep trading anecdotes and links back and forth (eg airbnb post from 2013 or amplab benchmarks), and we don't seem to be getting anywhere. So before we start down the prototype /benchmark road, and in desperation of finding *some* kind of objective third party perspective, was wondering if anyone who has used both in 2014 would care to provide commentary about the sweet spot use cases / gotchas for non trivial use (eg a simple filter scan isn't really interesting). Soft issues like operational maintenance and time spent developing v out of the box are interesting too... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark v Redshift
This is pretty spot on.. though I would also add that the Spark features that it touts around speed are all dependent on caching the data into memory... reading off the disk still takes time..ie pulling the data into an RDD. This is the reason that Spark is great for ML... the data is used over and over again to fit models so its pulled into memory once then basically analyzed through the algos... other DBs systems are reading and writing to disk repeatedly and are thus slower, such as mahout (though its getting ported over to Spark as well to compete with MLlib)... J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: ji...@sellpoints.com *M*: *510.303.7751* On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box. Matei On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote: I'm in the midst of a heated debate about the use of Redshift v Spark with a colleague. We keep trading anecdotes and links back and forth (eg airbnb post from 2013 or amplab benchmarks), and we don't seem to be getting anywhere. So before we start down the prototype /benchmark road, and in desperation of finding *some* kind of objective third party perspective, was wondering if anyone who has used both in 2014 would care to provide commentary about the sweet spot use cases / gotchas for non trivial use (eg a simple filter scan isn't really interesting). Soft issues like operational maintenance and time spent developing v out of the box are interesting too... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark v Redshift
There is no one size fits all solution available in the market today. If somebody tell you they do then they are simply lying :) Both solutions cater to different set of problems. My recommendation is to put real focus on getting better understanding of your problems that you are trying to solve with Spark and Redshift and pick tool based on how effectively they handle those problems. Like Matei said, both might be relevant in some cases. Thanks Akshar On Tue, Nov 4, 2014 at 4:00 PM, Jimmy McErlain ji...@sellpoints.com wrote: This is pretty spot on.. though I would also add that the Spark features that it touts around speed are all dependent on caching the data into memory... reading off the disk still takes time..ie pulling the data into an RDD. This is the reason that Spark is great for ML... the data is used over and over again to fit models so its pulled into memory once then basically analyzed through the algos... other DBs systems are reading and writing to disk repeatedly and are thus slower, such as mahout (though its getting ported over to Spark as well to compete with MLlib)... J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: ji...@sellpoints.com *M*: *510.303.7751 510.303.7751* On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use Spark to do the ETL that will put data into a database such as Redshift, or you might pull data out of Redshift into Spark for machine learning. On the other hand, if *all* you want to do is SQL and you are okay with the set of data formats and features in Redshift (i.e. you can express everything using its UDFs and you have a way to get data in), then Redshift is a complete service which will do more management out of the box. Matei On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote: I'm in the midst of a heated debate about the use of Redshift v Spark with a colleague. We keep trading anecdotes and links back and forth (eg airbnb post from 2013 or amplab benchmarks), and we don't seem to be getting anywhere. So before we start down the prototype /benchmark road, and in desperation of finding *some* kind of objective third party perspective, was wondering if anyone who has used both in 2014 would care to provide commentary about the sweet spot use cases / gotchas for non trivial use (eg a simple filter scan isn't really interesting). Soft issues like operational maintenance and time spent developing v out of the box are interesting too... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Akshar Dave Principal – Big Data SoftNet Solutions Office: 408.542.0888 | Mobile: 408.896.1486 940 Hamlin Court, Sunnyvale, CA 94089 www.softnets.com/bigdata
Re: Spark v Redshift
Sounds like context would help, I just didn't want to subject people to a wall of text if it wasn't necessary :) Currently we use neither Spark SQL (or anything in the Hadoop stack) or Redshift. We service templated queries from the appserver, i.e. user fills out some forms, dropdowns: we translate to a query. Data is basically one table containing thousands of independent time series, with one or two tables of reference data to join to. e.g. median value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1 and T2 joining on a surrogate key, group by a different Field3. The data structure is a little bit dynamic. User can upload any CSV, as long as they tell us the name of each column and the programmatic type. The target data size is about a billion records, 20'ish fields, distributed throughout a year (about 50GB on disk as CSV, uncompressed). So we're currently doing historical analytics (e.g. see analytic results of only yesterday's data or older, but want to see the result quickly). We eventually intend to do realtime (or streaming) analytics (i.e. see the impact of new data on analytics quickly). Machine learning is also on the roadmap. One proposition is for Spark SQL as a complete replacement for Redshift. It would simplify the architecture, since our long term strategy is to handle data intake and ETL on HDFS (regardless of Redshift or Spark SQL). The other parts of the Hadoop family that would come into play for ETL is undetermined right now. Spark SQL appears to have relational ability, and if we're going to use the Hadoop stack for ML and streaming analytics, and it has the ability, why not do it all on one stack and not shovel data around? Also, lots of people talking about it. The other proposition is Redshift as the historical analytics solution, and something else (could be Spark, doesn't matter) for streaming analytics and ML. If we need to relate the two, we'll have an API or process to stitch it together. I've read about the lambda architecture, which more or less describes this approach. The motivation is Redshift has the AWS reliability/scalability/operational concerns worked out, richer query language (SQL and pgsql functions are designed for slice-n-dice analytics) so we can spend our coding time elsewhere, and a measure of safety against design issues and bugs: Spark just came out of incubator status this year, and it's much easier to find people on the web raving positively about Redshift in real-world usage (i.e. part of live, client-facing system) than Spark. category_theory's observation that most of the speed comes from fitting in memory is helpful. It's what I would have surmised from the AMPLab Big Data benchmark, but confirmation from the hands-on community is invaluable, thank you. I understand a lot of it simply has to do with what-do-you-value-more weightings, and we'll do prototypes/benchmarks if we have to, just wasn't sure if there were any other key assumptions/requirements/gotchas to consider. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org