Ewan Higgs created SPARK-13434: ---------------------------------- Summary: Reduce Spark RandomForest memory footprint Key: SPARK-13434 URL: https://issues.apache.org/jira/browse/SPARK-13434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.6.0 Environment: Linux Reporter: Ewan Higgs
The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one. Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in {{spark-defaults.conf}}: {code} spark.executor.cores 2 spark.executor.instances 199 spark.executor.memory 10240M {code} I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell --driver-memory 30G --executor-memory 30G}} and have a heap profile from a single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample every 5 seconds and at the peak it looks like this: {code} num #instances #bytes class name ---------------------------------------------- 1: 5428073 8458773496 [D 2: 12293653 4124641992 [I 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict 5: 72853787 1165660592 scala.Some 6: 16263408 910750848 org.apache.spark.mllib.tree.model.InformationGainStats 7: 72969 390492744 [B 8: 3327008 133080320 org.apache.spark.mllib.tree.impl.DTStatsAggregator 9: 3754500 120144000 scala.collection.immutable.HashMap$HashMap1 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split 11: 3534946 84838704 org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo 12: 3764745 60235920 java.lang.Integer 13: 3327008 53232128 org.apache.spark.mllib.tree.impurity.EntropyAggregator 14: 380804 45361144 [C 15: 268887 34877128 <constMethodKlass> 16: 268887 34431568 <methodKlass> 17: 908377 34042760 [Lscala.collection.immutable.HashMap; 18: 1100000 26400000 org.apache.spark.mllib.regression.LabeledPoint 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector 20: 20206 25979864 <constantPoolKlass> 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint 22: 1000000 24000000 org.apache.spark.mllib.tree.impl.BaggedPoint 23: 908332 21799968 scala.collection.immutable.HashMap$HashTrieMap 24: 20206 20158864 <instanceKlassKlass> 25: 17023 14380352 <constantPoolCacheKlass> 26: 16 13308288 [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; 27: 445797 10699128 scala.Tuple2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org