I don't have those numbers off-hand. Though the shuffle spill to disk was coming to several gigabytes per node, if I recall correctly.
The MapReduce pipeline takes about 2-3 hours I think for the full 60 day data set. Spark chugs along fine for awhile and then hangs. We restructured the flow a few times, but in the last iteration it was hanging when trying to save the feature profiles with just a couple of tasks remaining (those tasks ran for 10+ hours before we killed it). In a previous iteration we did get it to run through. We broke our flow into two parts though - first saving the raw profiles out to disk, then reading them back in for scoring. That was on just 10 days of data, by the way - one sixth of what the MapReduce flow normally runs through on the same cluster. I haven't tracked down the cause. YMMV On Mon, Jul 7, 2014 at 8:14 PM, Soumya Simanta <soumya.sima...@gmail.com> wrote: > > > Daniel, > > Do you mind sharing the size of your cluster and the production data > volumes ? > > Thanks > Soumya > > On Jul 7, 2014, at 3:39 PM, Daniel Siegmann <daniel.siegm...@velos.io> > wrote: > > From a development perspective, I vastly prefer Spark to MapReduce. The > MapReduce API is very constrained; Spark's API feels much more natural to > me. Testing and local development is also very easy - creating a local > Spark context is trivial and it reads local files. For your unit tests you > can just have them create a local context and execute your flow with some > test data. Even better, you can do ad-hoc work in the Spark shell and if > you want that in your production code it will look exactly the same. > > Unfortunately, the picture isn't so rosy when it gets to production. In my > experience, Spark simply doesn't scale to the volumes that MapReduce will > handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be > better, but I haven't had the opportunity to try them. I find jobs tend to > just hang forever for no apparent reason on large data sets (but smaller > than what I push through MapReduce). > > I am hopeful the situation will improve - Spark is developing quickly - > but if you have large amounts of data you should proceed with caution. > > Keep in mind there are some frameworks for Hadoop which can hide the ugly > MapReduce with something very similar in form to Spark's API; e.g. Apache > Crunch. So you might consider those as well. > > (Note: the above is with Spark 1.0.0.) > > > > On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> > wrote: > >> Hello Experts, >> >> >> >> I am doing some comparative study on the below: >> >> >> >> Spark vs Impala >> >> Spark vs MapREduce . Is it worth migrating from existing MR >> implementation to Spark? >> >> >> >> >> >> Please share your thoughts and expertise. >> >> >> >> >> >> Thanks, >> Santosh >> >> ------------------------------ >> >> This message is for the designated recipient only and may contain >> privileged, proprietary, or otherwise confidential information. If you have >> received it in error, please notify the sender immediately and delete the >> original. Any other use of the e-mail by you is prohibited. Where allowed >> by local law, electronic communications with Accenture and its affiliates, >> including e-mail and instant messaging (including content), may be scanned >> by our systems for the purposes of information security and assessment of >> internal compliance with Accenture policy. >> >> ______________________________________________________________________________________ >> >> www.accenture.com >> > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > E: daniel.siegm...@velos.io W: www.velos.io > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io