Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto de Vicente
No. I am running Spark on YARN on a 3 node testing cluster. My guess is that given the amount of splits done by a hundred trees of depth 30 (which should be more than 100 * 2^30), either the executors or the driver die OOM while trying to store all the split metadata. I guess that the same

Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Marco Mistroni
You running locally? Found exactly same issue. 2 solutions: _ reduce datA size. _ run on EMR Hth On 10 Jan 2017 10:07 am, "Julio Antonio Soto" wrote: > Hi, > > I am running into OOM problems while training a Spark ML > RandomForestClassifier (maxDepth of 30, 32 maxBins, 100

Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Julio Antonio Soto
Hi, I am running into OOM problems while training a Spark ML RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees). My dataset is arguably pretty big given the executor count and size (8x5G), with approximately 20M rows and 130 features. The "fun fact" is that a single