[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156921#comment-15156921 ]
Ewan Higgs commented on SPARK-13434: ------------------------------------ SPARK-3728 is titled with a similar intent to this, but the issue description immediately sets out to discuss writing data to disk to handle out of memory data. This ticket is more focused on reducing the memory used. > Reduce Spark RandomForest memory footprint > ------------------------------------------ > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.6.0 > Environment: Linux > Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > ---------------------------------------------- > 1: 5428073 8458773496 [D > 2: 12293653 4124641992 [I > 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node > 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict > 5: 72853787 1165660592 scala.Some > 6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats > 7: 72969 390492744 [B > 8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator > 9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14: 380804 45361144 [C > 15: 268887 34877128 <constMethodKlass> > 16: 268887 34431568 <methodKlass> > 17: 908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 1100000 26400000 > org.apache.spark.mllib.regression.LabeledPoint > 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 <constantPoolKlass> > 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint > 22: 1000000 24000000 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23: 908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 <instanceKlassKlass> > 25: 17023 14380352 <constantPoolCacheKlass> > 26: 16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27: 445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org