It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only. While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time.
We've evaluated data sets much greater than 10GB in Spark using the
Spark master and Spark with Yarn (cluster -- formerly standalone --
mode). Nice thing about using Yarn is that it reports the actual
memory demand, not just the memory requested for driver and
workers. Processing a 60GB data set through thousands of stages in
a rather complex set of analytics and transformations consumed a
total cluster resource (divided among all workers and driver) of
only 9GB. We were somewhat startled at first by this result,
thinking that it would be much greater, but realized that it is a
consequence of Spark's lazy evaluation model. This is even with
several intermediate computations being cached as input to multiple
evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth
Hiraman wrote:
|
- Re: Comparative study Daniel Siegmann
- RE: Comparative study santosh.viswanathan
- Re: Comparative study Nabeel Memon
- Re: Comparative study Sean Owen
- Re: Comparative study Daniel Siegmann
- Re: Comparative study Soumya Simanta
- Re: Comparative study Daniel Siegmann
- Re: Comparative study Kevin Markey
- Re: Comparative study Surendranauth Hiraman
- Re: Comparative study Daniel Siegmann
- Re: Comparative study Kevin Markey
- Re: Comparative study Surendranauth Hiraman
- Re: Comparative study Kevin Markey
- Re: Comparative study Surendranauth Hiraman
- Re: Comparative study Surendranauth Hiraman
- Re: Comparative study Surendranauth Hiraman
- Re: Comparative study Sean Owen
- Re: Comparative study Reynold Xin
- Re: Comparative study Daniel Siegmann
- Re: Comparative study Aaron Davidson
- Re: Comparative study Surendranauth Hiraman