I would argue that memory in clusters is still a limited resource and it's still beneficial to use memory as economically as possible. Let's say that you are training a gradient boosted model in Spark, which could conceivably take several hours to build hundreds to thousands of trees. You do not want to be occupying a significant portion of the cluster memory such that nobody else can run anything of significance.
We have a dataset that's only ~10GB CSV in the file system, now once we cached the whole thing in Spark, it ballooned to 64 GB or so in memory and so we had to use a lot more workers with memory just so that we could cache the whole thing - this was due to the fact that although all the features were byte-sized, MLLib defaults to Double. On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > I don't think the YARN default of max 8GB container size is a good > justification for limiting memory per worker. This is a sort of arbitrary > number that came from an era where MapReduce was the main YARN application > and machines generally had less memory. I expect to see this to get > configured as much higher in practice on most clusters running Spark. > > YARN integration is actually complete in CDH5.0. We support it as well as > standalone mode. > > > > > On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen <so...@cloudera.com> wrote: > >> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung >> <coded...@cs.stanford.edu> wrote: >> > Debasish, >> > >> > Unfortunately, we are bound to YARN, at least for the time being, >> because >> > that's what most of our customers would be using (unless, all the Hadoop >> > vendors start supporting standalone Spark - I think Cloudera might do >> > that?). >> >> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the >> YARN integration is still being worked on. >> > >