Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
provides a broader set of capabilities than Redshift because it has APIs in 
general-purpose languages (Java, Scala, Python) and libraries for things like 
machine learning and graph processing. For example, you might use Spark to do 
the ETL that will put data into a database such as Redshift, or you might pull 
data out of Redshift into Spark for machine learning. On the other hand, if 
*all* you want to do is SQL and you are okay with the set of data formats and 
features in Redshift (i.e. you can express everything using its UDFs and you 
have a way to get data in), then Redshift is a complete service which will do 
more management out of the box.

Matei

 On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
 I'm in the midst of a heated debate about the use of Redshift v Spark with a
 colleague.  We keep trading anecdotes and links back and forth (eg airbnb
 post from 2013 or amplab benchmarks), and we don't seem to be getting
 anywhere. 
 
 So before we start down the prototype /benchmark road, and in desperation 
 of finding *some* kind of objective third party perspective,  was wondering
 if anyone who has used both in 2014 would care to provide commentary about
 the sweet spot use cases / gotchas for non trivial use (eg a simple filter
 scan isn't really interesting).  Soft issues like operational maintenance
 and time spent developing v out of the box are interesting too... 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
BTW while I haven't actually used Redshift, I've seen many companies that use 
both, usually using Spark for ETL and advanced analytics and Redshift for SQL 
on the cleaned / summarized data. Xiangrui Meng also wrote 
https://github.com/mengxr/redshift-input-format to make it easy to read data 
exported from Redshift into Spark or Hadoop.

Matei

 On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
 provides a broader set of capabilities than Redshift because it has APIs in 
 general-purpose languages (Java, Scala, Python) and libraries for things like 
 machine learning and graph processing. For example, you might use Spark to do 
 the ETL that will put data into a database such as Redshift, or you might 
 pull data out of Redshift into Spark for machine learning. On the other hand, 
 if *all* you want to do is SQL and you are okay with the set of data formats 
 and features in Redshift (i.e. you can express everything using its UDFs and 
 you have a way to get data in), then Redshift is a complete service which 
 will do more management out of the box.
 
 Matei
 
 On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
 I'm in the midst of a heated debate about the use of Redshift v Spark with a
 colleague.  We keep trading anecdotes and links back and forth (eg airbnb
 post from 2013 or amplab benchmarks), and we don't seem to be getting
 anywhere. 
 
 So before we start down the prototype /benchmark road, and in desperation 
 of finding *some* kind of objective third party perspective,  was wondering
 if anyone who has used both in 2014 would care to provide commentary about
 the sweet spot use cases / gotchas for non trivial use (eg a simple filter
 scan isn't really interesting).  Soft issues like operational maintenance
 and time spent developing v out of the box are interesting too... 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v Redshift

2014-11-04 Thread Jimmy McErlain
This is pretty spot on.. though I would also add that the Spark features
that it touts around speed are all dependent on caching the data into
memory... reading off the disk still takes time..ie pulling the data into
an RDD.  This is the reason that Spark is great for ML... the data is used
over and over again to fit models so its pulled into memory once then
basically analyzed through the algos... other DBs systems are reading and
writing to disk repeatedly and are thus slower, such as mahout (though its
getting ported over to Spark as well to compete with MLlib)...

J
ᐧ




*JIMMY MCERLAIN*

DATA SCIENTIST (NERD)

*. . . . . . . . . . . . . . . . . .*


*IF WE CAN’T DOUBLE YOUR SALES,*



*ONE OF US IS IN THE WRONG BUSINESS.*

*E*: ji...@sellpoints.com

*M*: *510.303.7751*

On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Is this about Spark SQL vs Redshift, or Spark in general? Spark in general
 provides a broader set of capabilities than Redshift because it has APIs in
 general-purpose languages (Java, Scala, Python) and libraries for things
 like machine learning and graph processing. For example, you might use
 Spark to do the ETL that will put data into a database such as Redshift, or
 you might pull data out of Redshift into Spark for machine learning. On the
 other hand, if *all* you want to do is SQL and you are okay with the set of
 data formats and features in Redshift (i.e. you can express everything
 using its UDFs and you have a way to get data in), then Redshift is a
 complete service which will do more management out of the box.

 Matei

  On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
  I'm in the midst of a heated debate about the use of Redshift v Spark
 with a
  colleague.  We keep trading anecdotes and links back and forth (eg airbnb
  post from 2013 or amplab benchmarks), and we don't seem to be getting
  anywhere.
 
  So before we start down the prototype /benchmark road, and in desperation
  of finding *some* kind of objective third party perspective,  was
 wondering
  if anyone who has used both in 2014 would care to provide commentary
 about
  the sweet spot use cases / gotchas for non trivial use (eg a simple
 filter
  scan isn't really interesting).  Soft issues like operational maintenance
  and time spent developing v out of the box are interesting too...
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark v Redshift

2014-11-04 Thread Akshar Dave
There is no one size fits all solution available in the market today. If
somebody tell you they do then they are simply lying :)

Both solutions cater to different set of problems. My recommendation is to
put real focus on getting better understanding of your problems that you
are trying to solve with Spark and Redshift and pick tool based on how
effectively they handle those problems. Like Matei said, both might be
relevant in some cases.

Thanks
Akshar


On Tue, Nov 4, 2014 at 4:00 PM, Jimmy McErlain ji...@sellpoints.com wrote:

 This is pretty spot on.. though I would also add that the Spark features
 that it touts around speed are all dependent on caching the data into
 memory... reading off the disk still takes time..ie pulling the data into
 an RDD.  This is the reason that Spark is great for ML... the data is used
 over and over again to fit models so its pulled into memory once then
 basically analyzed through the algos... other DBs systems are reading and
 writing to disk repeatedly and are thus slower, such as mahout (though its
 getting ported over to Spark as well to compete with MLlib)...

 J
 ᐧ




 *JIMMY MCERLAIN*

 DATA SCIENTIST (NERD)

 *. . . . . . . . . . . . . . . . . .*


 *IF WE CAN’T DOUBLE YOUR SALES,*



 *ONE OF US IS IN THE WRONG BUSINESS.*

 *E*: ji...@sellpoints.com

 *M*: *510.303.7751 510.303.7751*

 On Tue, Nov 4, 2014 at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Is this about Spark SQL vs Redshift, or Spark in general? Spark in
 general provides a broader set of capabilities than Redshift because it has
 APIs in general-purpose languages (Java, Scala, Python) and libraries for
 things like machine learning and graph processing. For example, you might
 use Spark to do the ETL that will put data into a database such as
 Redshift, or you might pull data out of Redshift into Spark for machine
 learning. On the other hand, if *all* you want to do is SQL and you are
 okay with the set of data formats and features in Redshift (i.e. you can
 express everything using its UDFs and you have a way to get data in), then
 Redshift is a complete service which will do more management out of the box.

 Matei

  On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
  I'm in the midst of a heated debate about the use of Redshift v Spark
 with a
  colleague.  We keep trading anecdotes and links back and forth (eg
 airbnb
  post from 2013 or amplab benchmarks), and we don't seem to be getting
  anywhere.
 
  So before we start down the prototype /benchmark road, and in
 desperation
  of finding *some* kind of objective third party perspective,  was
 wondering
  if anyone who has used both in 2014 would care to provide commentary
 about
  the sweet spot use cases / gotchas for non trivial use (eg a simple
 filter
  scan isn't really interesting).  Soft issues like operational
 maintenance
  and time spent developing v out of the box are interesting too...
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Akshar Dave
Principal – Big Data
SoftNet Solutions
Office: 408.542.0888 | Mobile: 408.896.1486
940 Hamlin Court, Sunnyvale, CA 94089
www.softnets.com/bigdata


Re: Spark v Redshift

2014-11-04 Thread agfung
Sounds like context would help, I just didn't want to subject people to a
wall of text if it wasn't necessary :)

Currently we use neither Spark SQL (or anything in the Hadoop stack) or
Redshift.  We service templated queries from the appserver, i.e. user fills
out some forms, dropdowns: we translate to a query.

Data is basically one table containing thousands of independent time
series, with one or two tables of reference data to join to.  e.g. median
value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1
and T2 joining on a surrogate key, group by a different Field3.  The data
structure is a little bit dynamic.  User can upload any CSV, as long as they
tell us the name of each column and the programmatic type.  The target data
size is about a billion records, 20'ish fields, distributed throughout a
year (about 50GB on disk as CSV, uncompressed).

So we're currently doing historical analytics (e.g. see analytic results
of only yesterday's data or older, but want to see the result quickly).  
We eventually intend to do realtime (or streaming) analytics (i.e. see
the impact of new data on analytics quickly).  Machine learning is also on
the roadmap.

One proposition is for Spark SQL as a complete replacement for Redshift.  It
would simplify the architecture, since our long term strategy is to handle
data intake and ETL on HDFS (regardless of Redshift or Spark SQL).  The
other parts of the Hadoop family that would come into play for ETL is
undetermined right now.  Spark SQL appears to have relational ability, and
if we're going to use the Hadoop stack for ML and streaming analytics, and
it has the ability, why not do it all on one stack and not shovel data
around?  Also, lots of people talking about it.

The other proposition is Redshift as the historical analytics solution, and
something else (could be Spark, doesn't matter) for streaming analytics and
ML.   If we need to relate the two, we'll have an API or process to stitch
it together.   I've read about the lambda architecture, which more or less
describes this approach.  The motivation is Redshift has the AWS
reliability/scalability/operational concerns worked out, richer query
language (SQL and pgsql functions are designed for slice-n-dice analytics)
so we can spend our coding time elsewhere, and a measure of safety against
design issues and bugs: Spark just came out of incubator status this year,
and it's much easier to find people on the web raving positively about
Redshift in real-world usage (i.e. part of live, client-facing system) than
Spark.

category_theory's observation that most of the speed comes from fitting in
memory is helpful.  It's what I would have surmised from the AMPLab Big Data
benchmark, but confirmation from the hands-on community is invaluable, thank
you.

I understand a lot of it simply has to do with what-do-you-value-more
weightings, and we'll do prototypes/benchmarks if we have to, just wasn't
sure if there were any other key assumptions/requirements/gotchas to
consider.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org