I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs.
I don't know *why* that happened. Maybe it isn't the overall amount of data, but something I'm doing wrong with my flow. In any case, improvements to diagnostic info would probably be helpful. I look forward to the next release. :-) On Tue, Jul 8, 2014 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote: > Not sure exactly what is happening but perhaps there are ways to > restructure your program for it to work better. Spark is definitely able to > handle much, much larger workloads. > > I've personally run a workload that shuffled 300 TB of data. I've also ran > something that shuffled 5TB/node and stuffed my disks fairly full that the > file system is close to breaking. > > We can definitely do a better job in Spark to make it output more > meaningful diagnosis and more robust with partitions of data that don't fit > in memory though. A lot of the work in the next few releases will be on > that. > > > > On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> I'll respond for Dan. >> >> Our test dataset was a total of 10 GB of input data (full production >> dataset for this particular dataflow would be 60 GB roughly). >> >> I'm not sure what the size of the final output data was but I think it >> was on the order of 20 GBs for the given 10 GB of input data. Also, I can >> say that when we were experimenting with persist(DISK_ONLY), the size of >> all RDDs on disk was around 200 GB, which gives a sense of overall >> transient memory usage with no persistence. >> >> In terms of our test cluster, we had 15 nodes. Each node had 24 cores and >> 2 workers each. Each executor got 14 GB of memory. >> >> -Suren >> >> >> >> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com> >> wrote: >> >>> When you say "large data sets", how large? >>> Thanks >>> >>> >>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote: >>> >>> From a development perspective, I vastly prefer Spark to MapReduce. >>> The MapReduce API is very constrained; Spark's API feels much more natural >>> to me. Testing and local development is also very easy - creating a local >>> Spark context is trivial and it reads local files. For your unit tests you >>> can just have them create a local context and execute your flow with some >>> test data. Even better, you can do ad-hoc work in the Spark shell and if >>> you want that in your production code it will look exactly the same. >>> >>> Unfortunately, the picture isn't so rosy when it gets to production. >>> In my experience, Spark simply doesn't scale to the volumes that MapReduce >>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN >>> would be better, but I haven't had the opportunity to try them. I find jobs >>> tend to just hang forever for no apparent reason on large data sets (but >>> smaller than what I push through MapReduce). >>> >>> I am hopeful the situation will improve - Spark is developing quickly >>> - but if you have large amounts of data you should proceed with caution. >>> >>> Keep in mind there are some frameworks for Hadoop which can hide the >>> ugly MapReduce with something very similar in form to Spark's API; e.g. >>> Apache Crunch. So you might consider those as well. >>> >>> (Note: the above is with Spark 1.0.0.) >>> >>> >>> >>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> >>> wrote: >>> >>>> Hello Experts, >>>> >>>> >>>> >>>> I am doing some comparative study on the below: >>>> >>>> >>>> >>>> Spark vs Impala >>>> >>>> Spark vs MapREduce . Is it worth migrating from existing MR >>>> implementation to Spark? >>>> >>>> >>>> >>>> >>>> >>>> Please share your thoughts and expertise. >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> Santosh >>>> >>>> ------------------------------ >>>> >>>> This message is for the designated recipient only and may contain >>>> privileged, proprietary, or otherwise confidential information. If you have >>>> received it in error, please notify the sender immediately and delete the >>>> original. Any other use of the e-mail by you is prohibited. Where allowed >>>> by local law, electronic communications with Accenture and its affiliates, >>>> including e-mail and instant messaging (including content), may be scanned >>>> by our systems for the purposes of information security and assessment of >>>> internal compliance with Accenture policy. >>>> >>>> ______________________________________________________________________________________ >>>> >>>> www.accenture.com >>>> >>> >>> >>> >>> -- >>> Daniel Siegmann, Software Developer >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >>> E: daniel.siegm...@velos.io W: www.velos.io >>> >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io