Re: Dealing with 'smaller' data
The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer t...@preferred.jp wrote: Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). Will it ever become Hadoop size? Looking at the overhead of running even a simple Hadoop setup (securely and with good performance, given about 1e6 configuration parameters), I think it makes sense to stay in non-Hadoop mode as long as possible. People may disagree ;-) Tobias PS. You may also want to have a look at http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Re: Dealing with 'smaller' data
So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. Thanks, Gary On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? Well, RDBs are all about *storage*, while Spark is about *computation*. If you have a very expensive computation (that can be parallelized in some way), then you might want to use Spark, even though your data lives in an ordinary RDB. Think raytracing, where you do something for every pixel in the output image and you could get your scene description from a database, write the result to a database, but use Spark to do two minutes of calculation for every pixel in parallel (or so). Tobias
Re: Dealing with 'smaller' data
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? Well, RDBs are all about *storage*, while Spark is about *computation*. If you have a very expensive computation (that can be parallelized in some way), then you might want to use Spark, even though your data lives in an ordinary RDB. Think raytracing, where you do something for every pixel in the output image and you could get your scene description from a database, write the result to a database, but use Spark to do two minutes of calculation for every pixel in parallel (or so). Tobias
Re: Dealing with 'smaller' data
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). Will it ever become Hadoop size? Looking at the overhead of running even a simple Hadoop setup (securely and with good performance, given about 1e6 configuration parameters), I think it makes sense to stay in non-Hadoop mode as long as possible. People may disagree ;-) Tobias PS. You may also want to have a look at http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Re: Dealing with 'smaller' data
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote: So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. You got me there ;-) Tobias