Re: Random Forest on Spark

Sung Hwan Chung Fri, 18 Apr 2014 15:18:06 -0700

I would argue that memory in clusters is still a limited resource and it's
still beneficial to use memory as economically as possible. Let's say that
you are training a gradient boosted model in Spark, which could conceivably
take several hours to build hundreds to thousands of trees. You do not want
to be occupying a significant portion of the cluster memory such that
nobody else can run anything of significance.


We have a dataset that's only ~10GB CSV in the file system, now once we
cached the whole thing in Spark, it ballooned to 64 GB or so in memory and
so we had to use a lot more workers with memory just so that we could cache
the whole thing - this was due to the fact that although all the features
were byte-sized, MLLib defaults to Double.


On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> I don't think the YARN default of max 8GB container size is a good
> justification for limiting memory per worker.  This is a sort of arbitrary
> number that came from an era where MapReduce was the main YARN application
> and machines generally had less memory.  I expect to see this to get
> configured as much higher in practice on most clusters running Spark.
>
> YARN integration is actually complete in CDH5.0.  We support it as well as
> standalone mode.
>
>
>
>
> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>> <coded...@cs.stanford.edu> wrote:
>> > Debasish,
>> >
>> > Unfortunately, we are bound to YARN, at least for the time being,
>> because
>> > that's what most of our customers would be using (unless, all the Hadoop
>> > vendors start supporting standalone Spark - I think Cloudera might do
>> > that?).
>>
>> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
>> YARN integration is still being worked on.
>>
>
>

Re: Random Forest on Spark

Reply via email to