How wide are the rows of data, either the raw input data or any generated intermediate data?
We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey <kevin.mar...@oracle.com> wrote: > Nothing particularly custom. We've tested with small (4 node) > development clusters, single-node pseudoclusters, and bigger, using > plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, > Spark local, Spark Yarn (client and cluster) modes, with total memory > resources ranging from 4GB to 256GB+. > > K > > > > On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: > > To clarify, we are not persisting to disk. That was just one of the > experiments we did because of some issues we had along the way. > > At this time, we are NOT using persist but cannot get the flow to > complete in Standalone Cluster mode. We do not have a YARN-capable cluster > at this time. > > We agree with what you're saying. Your results are what we were hoping > for and expecting. :-) Unfortunately we still haven't gotten the flow to > run end to end on this relatively small dataset. > > It must be something related to our cluster, standalone mode or our flow > but as far as we can tell, we are not doing anything unusual. > > Did you do any custom configuration? Any advice would be appreciated. > > -Suren > > > > > On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey <kevin.mar...@oracle.com> > wrote: > >> It seems to me that you're not taking full advantage of the lazy >> evaluation, especially persisting to disk only. While it might be true >> that the cumulative size of the RDDs looks like it's 300GB, only a small >> portion of that should be resident at any one time. We've evaluated data >> sets much greater than 10GB in Spark using the Spark master and Spark with >> Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn >> is that it reports the actual memory *demand*, not just the memory >> requested for driver and workers. Processing a 60GB data set through >> thousands of stages in a rather complex set of analytics and >> transformations consumed a total cluster resource (divided among all >> workers and driver) of only 9GB. We were somewhat startled at first by >> this result, thinking that it would be much greater, but realized that it >> is a consequence of Spark's lazy evaluation model. This is even with >> several intermediate computations being cached as input to multiple >> evaluation paths. >> >> Good luck. >> >> Kevin >> >> >> >> On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: >> >> I'll respond for Dan. >> >> Our test dataset was a total of 10 GB of input data (full production >> dataset for this particular dataflow would be 60 GB roughly). >> >> I'm not sure what the size of the final output data was but I think it >> was on the order of 20 GBs for the given 10 GB of input data. Also, I can >> say that when we were experimenting with persist(DISK_ONLY), the size of >> all RDDs on disk was around 200 GB, which gives a sense of overall >> transient memory usage with no persistence. >> >> In terms of our test cluster, we had 15 nodes. Each node had 24 cores >> and 2 workers each. Each executor got 14 GB of memory. >> >> -Suren >> >> >> >> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com> >> wrote: >> >>> When you say "large data sets", how large? >>> Thanks >>> >>> >>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote: >>> >>> From a development perspective, I vastly prefer Spark to MapReduce. >>> The MapReduce API is very constrained; Spark's API feels much more natural >>> to me. Testing and local development is also very easy - creating a local >>> Spark context is trivial and it reads local files. For your unit tests you >>> can just have them create a local context and execute your flow with some >>> test data. Even better, you can do ad-hoc work in the Spark shell and if >>> you want that in your production code it will look exactly the same. >>> >>> Unfortunately, the picture isn't so rosy when it gets to production. >>> In my experience, Spark simply doesn't scale to the volumes that MapReduce >>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN >>> would be better, but I haven't had the opportunity to try them. I find jobs >>> tend to just hang forever for no apparent reason on large data sets (but >>> smaller than what I push through MapReduce). >>> >>> I am hopeful the situation will improve - Spark is developing quickly >>> - but if you have large amounts of data you should proceed with caution. >>> >>> Keep in mind there are some frameworks for Hadoop which can hide the >>> ugly MapReduce with something very similar in form to Spark's API; e.g. >>> Apache Crunch. So you might consider those as well. >>> >>> (Note: the above is with Spark 1.0.0.) >>> >>> >>> >>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> >>> wrote: >>> >>>> Hello Experts, >>>> >>>> >>>> >>>> I am doing some comparative study on the below: >>>> >>>> >>>> >>>> Spark vs Impala >>>> >>>> Spark vs MapREduce . Is it worth migrating from existing MR >>>> implementation to Spark? >>>> >>>> >>>> >>>> >>>> >>>> Please share your thoughts and expertise. >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> Santosh >>>> >>>> ------------------------------ >>>> >>>> This message is for the designated recipient only and may contain >>>> privileged, proprietary, or otherwise confidential information. If you have >>>> received it in error, please notify the sender immediately and delete the >>>> original. Any other use of the e-mail by you is prohibited. Where allowed >>>> by local law, electronic communications with Accenture and its affiliates, >>>> including e-mail and instant messaging (including content), may be scanned >>>> by our systems for the purposes of information security and assessment of >>>> internal compliance with Accenture policy. >>>> >>>> ______________________________________________________________________________________ >>>> >>>> www.accenture.com >>>> >>> >>> >>> >>> -- >>> Daniel Siegmann, Software Developer >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >>> E: daniel.siegm...@velos.io W: www.velos.io >>> >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 <%28917%29%20525-2466%20ext.%20105> >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io