Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
The honest answer is that it is unclear to me at this point.  I guess what
I am really wondering is if there are cases where one would find it
beneficial to use Spark against one or more RDBs?

On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Gary,

 On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

 I'm considering whether or not it is worth introducing Spark at my new
 company.  The data is no-where near Hadoop size at this point (it sits in
 an RDS Postgres cluster).


 Will it ever become Hadoop size? Looking at the overhead of running even
 a simple Hadoop setup (securely and with good performance, given about 1e6
 configuration parameters), I think it makes sense to stay in non-Hadoop
 mode as long as possible. People may disagree ;-)

 Tobias

 PS. You may also want to have a look at
 http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html




Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.

Thanks,

Gary

On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi

 On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

 The honest answer is that it is unclear to me at this point.  I guess
 what I am really wondering is if there are cases where one would find it
 beneficial to use Spark against one or more RDBs?


 Well, RDBs are all about *storage*, while Spark is about *computation*. If
 you have a very expensive computation (that can be parallelized in some
 way), then you might want to use Spark, even though your data lives in an
 ordinary RDB. Think raytracing, where you do something for every pixel in
 the output image and you could get your scene description from a database,
 write the result to a database, but use Spark to do two minutes of
 calculation for every pixel in parallel (or so).

 Tobias






Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi

On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote:

 The honest answer is that it is unclear to me at this point.  I guess what
 I am really wondering is if there are cases where one would find it
 beneficial to use Spark against one or more RDBs?


Well, RDBs are all about *storage*, while Spark is about *computation*. If
you have a very expensive computation (that can be parallelized in some
way), then you might want to use Spark, even though your data lives in an
ordinary RDB. Think raytracing, where you do something for every pixel in
the output image and you could get your scene description from a database,
write the result to a database, but use Spark to do two minutes of
calculation for every pixel in parallel (or so).

Tobias


Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary,

On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote:

 I'm considering whether or not it is worth introducing Spark at my new
 company.  The data is no-where near Hadoop size at this point (it sits in
 an RDS Postgres cluster).


Will it ever become Hadoop size? Looking at the overhead of running even
a simple Hadoop setup (securely and with good performance, given about 1e6
configuration parameters), I think it makes sense to stay in non-Hadoop
mode as long as possible. People may disagree ;-)

Tobias

PS. You may also want to have a look at
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html


Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote:

 So when deciding whether to take on installing/configuring Spark, the size
 of the data does not automatically make that decision in your mind.


You got me there ;-)

Tobias