I had similar experience last week. Even I could not find any error trace.

Later on, I did the following to get rid of the problem:
i) I downgraded to Spark 2.0.0
ii) Decreased the value of maxBins and maxDepth

Additionally, make sure that you set the featureSubsetStrategy as "auto" to
let the algorithm choose the best feature subset strategy for your data.
Finally, set the impurity as "gini" for the information gain.

However, setting the value of no. of trees to just 1 does not give you
either real advantage of the forest neither better predictive performance.


On Dec 9, 2016 11:29 PM, "mhornbech" <mor...@datasolvr.com> wrote:

> Hi
> I have spent quite some time trying to debug an issue with the Random
> Forest
> algorithm on Spark 2.0.2. The input dataset is relatively large at around
> 600k rows and 200MB, but I use subsampling to make each tree manageable.
> However even with only 1 tree and a low sample rate of 0.05 the job hangs
> at
> one of the final stages (see attached). I have checked the logs on all
> executors and the driver and find no traces of error. Could it be a memory
> issue even though no error appears? The error does seem sporadic to some
> extent so I also wondered whether it could be a data issue, that only
> occurs
> if the subsample includes the bad data rows.
> Please comment if you have a clue.
> Morten
> <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-
> error-tp28192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to