Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas

 1) We get tooling out of the box from RedShift (specifically, stable JDBC
 access) - Spark we often are waiting for devops to get the right combo of
 tools working or for libraries to support sequence files.


The arguments about JDBC access and simpler setup definitely make sense. My
first non-trivial Spark application was actually an ETL process that sliced
and diced JSON + tabular data and then loaded it into Redshift. From there
on you got all the benefits of your average C-store database, plus the
added benefit of Amazon managing many annoying setup and admin details for
your Redshift cluster.

One area I'm looking forward to seeing Spark SQL excel at is offering fast
JDBC access to raw data--i.e. directly against S3 / HDFS; no ETL
required. For easy and flexible data exploration, I don't think you can
beat that with a C-store that you have to ETL stuff into.

2) There is a belief that for many of our queries (assumed to often be
 joins) a columnar database will perform orders of magnitude better.


This is definitely a it depends statement, but there is a detailed
benchmark here https://amplab.cs.berkeley.edu/benchmark/ comparing Shark,
Redshift, and other systems. Have you seen it? Redshift does very well, but
Shark is on par or better than it in most of the tests. Of course, going
forward we'll want to see Spark SQL match this kind of performance, and
that remains to be seen.

Nick



On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf malouf.g...@gmail.com wrote:

 My company is leaning towards moving much of their analytics work from our
 own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
 the internal advocate for using Spark for analytics, but a number of good
 points have been brought up to me.  The reasons being pushed are:

 - RedShift exposes a jdbc interface out of the box (no devops work there)
 and data looks and feels like it is in a normal sql database.  They want
 this out of the box from Spark, no trying to figure out which version
 matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
 theoretically supports this but there have been release issues our team has
 battled to date that erode the trust.

 - Complaints around challenges we have faced running a spark shell locally
 against a cluster in EC2.  It is partly a devops issue of deploying the
 correct configurations to local machines, being able to kick a user off
 hogging RAM, etc.

 - I want to be able to run queries from my python shell against your
 sequence file data, roll it up and in the same shell leverage python graph
 tools.  - I'm not very familiar with the Python setup, but I believe by
 being able to run locally AND somehow add custom libraries to be accessed
 from PySpark this could be done.

 - Joins will perform much better (in RedShift) because it says it sorts
 it's keys.  We cannot pre-compute all joins away.


 Basically, their argument is two-fold:

 1) We get tooling out of the box from RedShift (specifically, stable JDBC
 access) - Spark we often are waiting for devops to get the right combo of
 tools working or for libraries to support sequence files.

 2) There is a belief that for many of our queries (assumed to often be
 joins) a columnar database will perform orders of magnitude better.



 Anyway, a test is being setup to compare the two on the performance side
 but from a tools perspective it's hard to counter the issues that are
 brought up.



RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD 
machines instead of SSD, and it is still faster than Shark in all but one case.

Like Gary, I'm also interested in replacing something we have on Redshift with 
Spark SQL, as it will give me much greater capability to process things. I'm 
willing to sacrifice some performance for the greater capability. But it would 
be nice to see the benchmark updated with Spark SQL, and with a more 
competitive configuration of Redshift.

Best regards, and keep up the great work!

Ron


From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Wednesday, August 06, 2014 9:30 AM
To: Gary Malouf
Cc: user
Subject: Re: Regarding tooling/performance vs RedShift

1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

The arguments about JDBC access and simpler setup definitely make sense. My 
first non-trivial Spark application was actually an ETL process that sliced and 
diced JSON + tabular data and then loaded it into Redshift. From there on you 
got all the benefits of your average C-store database, plus the added benefit 
of Amazon managing many annoying setup and admin details for your Redshift 
cluster.

One area I'm looking forward to seeing Spark SQL excel at is offering fast JDBC 
access to raw data--i.e. directly against S3 / HDFS; no ETL required. For 
easy and flexible data exploration, I don't think you can beat that with a 
C-store that you have to ETL stuff into.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.

This is definitely a it depends statement, but there is a detailed benchmark 
herehttps://amplab.cs.berkeley.edu/benchmark/ comparing Shark, Redshift, and 
other systems. Have you seen it? Redshift does very well, but Shark is on par 
or better than it in most of the tests. Of course, going forward we'll want to 
see Spark SQL match this kind of performance, and that remains to be seen.

Nick


On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf 
malouf.g...@gmail.commailto:malouf.g...@gmail.com wrote:
My company is leaning towards moving much of their analytics work from our own 
Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been the 
internal advocate for using Spark for analytics, but a number of good points 
have been brought up to me.  The reasons being pushed are:

- RedShift exposes a jdbc interface out of the box (no devops work there) and 
data looks and feels like it is in a normal sql database.  They want this out 
of the box from Spark, no trying to figure out which version matches this 
version of Hive/Shark/SparkSQL etc.  Yes, the next release theoretically 
supports this but there have been release issues our team has battled to date 
that erode the trust.

- Complaints around challenges we have faced running a spark shell locally 
against a cluster in EC2.  It is partly a devops issue of deploying the correct 
configurations to local machines, being able to kick a user off hogging RAM, 
etc.

- I want to be able to run queries from my python shell against your sequence 
file data, roll it up and in the same shell leverage python graph tools.  - 
I'm not very familiar with the Python setup, but I believe by being able to run 
locally AND somehow add custom libraries to be accessed from PySpark this could 
be done.

- Joins will perform much better (in RedShift) because it says it sorts it's 
keys.  We cannot pre-compute all joins away.


Basically, their argument is two-fold:

1) We get tooling out of the box from RedShift (specifically, stable JDBC 
access) - Spark we often are waiting for devops to get the right combo of tools 
working or for libraries to support sequence files.

2) There is a belief that for many of our queries (assumed to often be joins) a 
columnar database will perform orders of magnitude better.



Anyway, a test is being setup to compare the two on the performance side but 
from a tools perspective it's hard to counter the issues that are brought up.



Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Forgot to cc the mailing list :)


On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

  Agreed. Being able to use SQL to make a table, pass it to a graph
 algorithm, pass that output to a machine learning algorithm, being able to
 invoke user defined python functions, … are capabilities that far exceed
 what we can do with Redshift. The total performance will be much better,
 and the programmer productivity will be much better, even if the SQL
 portion is not quite as fast.  Mostly I was just objecting to  Redshift
 does very well, but Shark is on par or better than it in most of the tests
  when that was not how I read the results, and Redshift was on HDDs.



 BTW – What are you doing w/ Spark? We have a lot of text and other content
 that we want to mine, and are shifting onto Spark so we have the greater
 capabilities mentioned above.





 Best regards,



 Ron Daniel, Jr.

 Director, Elsevier Labs

 r.dan...@elsevier.com

 mobile: +1 619 208 3064







 *From:* Gary Malouf [mailto:malouf.g...@gmail.com]
 *Sent:* Wednesday, August 06, 2014 12:35 PM
 *To:* Daniel, Ronald (ELS-SDG)

 *Subject:* Re: Regarding tooling/performance vs RedShift



 Hi Ronald,



 In my opinion, the performance just has to be 'close' to make that piece
 irrelevant.  I think the real issue comes down to tooling and the ease of
 connecting their various python tools from the office to results coming out
 of Spark/other solution in 'the cloud'.





 On Wed, Aug 6, 2014 at 1:43 PM, Daniel, Ronald (ELS-SDG) 
 r.dan...@elsevier.com wrote:

 Just to point out that the benchmark you point to has Redshift running on
 HDD machines instead of SSD, and it is still faster than Shark in all but
 one case.



 Like Gary, I'm also interested in replacing something we have on Redshift
 with Spark SQL, as it will give me much greater capability to process
 things. I'm willing to sacrifice some performance for the greater
 capability. But it would be nice to see the benchmark updated with Spark
 SQL, and with a more competitive configuration of Redshift.



 Best regards, and keep up the great work!



 Ron





 *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
 *Sent:* Wednesday, August 06, 2014 9:30 AM
 *To:* Gary Malouf
 *Cc:* user


 *Subject:* Re: Regarding tooling/performance vs RedShift



 1) We get tooling out of the box from RedShift (specifically, stable JDBC
 access) - Spark we often are waiting for devops to get the right combo of
 tools working or for libraries to support sequence files.



 The arguments about JDBC access and simpler setup definitely make sense.
 My first non-trivial Spark application was actually an ETL process that
 sliced and diced JSON + tabular data and then loaded it into Redshift. From
 there on you got all the benefits of your average C-store database, plus
 the added benefit of Amazon managing many annoying setup and admin details
 for your Redshift cluster.



 One area I'm looking forward to seeing Spark SQL excel at is offering fast
 JDBC access to raw data--i.e. directly against S3 / HDFS; no ETL
 required. For easy and flexible data exploration, I don't think you can
 beat that with a C-store that you have to ETL stuff into.



 2) There is a belief that for many of our queries (assumed to often be
 joins) a columnar database will perform orders of magnitude better.



 This is definitely a it depends statement, but there is a detailed
 benchmark here https://amplab.cs.berkeley.edu/benchmark/ comparing
 Shark, Redshift, and other systems. Have you seen it? Redshift does very
 well, but Shark is on par or better than it in most of the tests. Of
 course, going forward we'll want to see Spark SQL match this kind of
 performance, and that remains to be seen.



 Nick





 On Wed, Aug 6, 2014 at 12:06 PM, Gary Malouf malouf.g...@gmail.com
 wrote:

 My company is leaning towards moving much of their analytics work from our
 own Spark/Mesos/HDFS/Cassandra set up to RedShift.  To date, I have been
 the internal advocate for using Spark for analytics, but a number of good
 points have been brought up to me.  The reasons being pushed are:



 - RedShift exposes a jdbc interface out of the box (no devops work there)
 and data looks and feels like it is in a normal sql database.  They want
 this out of the box from Spark, no trying to figure out which version
 matches this version of Hive/Shark/SparkSQL etc.  Yes, the next release
 theoretically supports this but there have been release issues our team has
 battled to date that erode the trust.



 - Complaints around challenges we have faced running a spark shell locally
 against a cluster in EC2.  It is partly a devops issue of deploying the
 correct configurations to local machines, being able to kick a user off
 hogging RAM, etc.



 - I want to be able to run queries from my python shell against your
 sequence file data, roll it up and in the same shell leverage python graph
 tools.  - I'm not very

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:

 Mostly I was just objecting to  Redshift does very well, but Shark is on
 par or better than it in most of the tests  when that was not how I read
 the results, and Redshift was on HDDs.


My bad. You are correct; the only test Shark (mem) does better on is test
#1 Scan Query.

And indeed, it would be good to see an updated benchmark with Redshift
running on SSDs.

Nick


Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Also, regarding something like redshift not having MLlib built in, much of
that could be done on the derived results.
On Aug 6, 2014 4:07 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:

 Mostly I was just objecting to  Redshift does very well, but Shark is
 on par or better than it in most of the tests  when that was not how I
 read the results, and Redshift was on HDDs.


 My bad. You are correct; the only test Shark (mem) does better on is test
 #1 Scan Query.

 And indeed, it would be good to see an updated benchmark with Redshift
 running on SSDs.

 Nick



RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Well yes, MLlib-like routines or pretty much anything else could be run on the 
derived results, but you have to unload the results from Redshift and then load 
them into some other tool. So it's nicer to leave them in memory and operate on 
them there. Major architectural advantage to Spark.

Ron


From: Gary Malouf [mailto:malouf.g...@gmail.com]
Sent: Wednesday, August 06, 2014 1:17 PM
To: Nicholas Chammas
Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org
Subject: Re: Regarding tooling/performance vs RedShift


Also, regarding something like redshift not having MLlib built in, much of that 
could be done on the derived results.
On Aug 6, 2014 4:07 PM, Nicholas Chammas 
nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com wrote:
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald 
(ELS-SDG)r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Mostly I was just objecting to  Redshift does very well, but Shark is on par 
or better than it in most of the tests  when that was not how I read the 
results, and Redshift was on HDDs.

My bad. You are correct; the only test Shark (mem) does better on is test #1 
Scan Query.

And indeed, it would be good to see an updated benchmark with Redshift running 
on SSDs.

Nick


Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 4:30 PM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

 Major architectural advantage to Spark.


Amen to that. For a really cool and succinct demonstration of this, check
out Aaron's demo http://youtu.be/sPhyePwo7FA?t=10m16s at the Hadoop
Summit earlier this ear where he combines SQL, machine learning, and stream
processing using Spark. I don't think you can do this with any other
platform.

Nick