[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156921#comment-15156921
 ] 

Ewan Higgs commented on SPARK-13434:
------------------------------------

SPARK-3728 is titled with a similar intent to this, but the issue description 
immediately sets out to discuss writing data to disk to handle out of memory 
data.

This ticket is more focused on reducing the memory used.

> Reduce Spark RandomForest memory footprint
> ------------------------------------------
>
>                 Key: SPARK-13434
>                 URL: https://issues.apache.org/jira/browse/SPARK-13434
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.0
>         Environment: Linux
>            Reporter: Ewan Higgs
>              Labels: decisiontree, mllib, randomforest
>         Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:       5428073     8458773496  [D
>    2:      12293653     4124641992  [I
>    3:      32508964     1820501984  org.apache.spark.mllib.tree.model.Node
>    4:      53068426     1698189632  org.apache.spark.mllib.tree.model.Predict
>    5:      72853787     1165660592  scala.Some
>    6:      16263408      910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>    7:         72969      390492744  [B
>    8:       3327008      133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>    9:       3754500      120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:       3318349      106187168  org.apache.spark.mllib.tree.model.Split
>   11:       3534946       84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:       3764745       60235920  java.lang.Integer
>   13:       3327008       53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:        380804       45361144  [C
>   15:        268887       34877128  <constMethodKlass>
>   16:        268887       34431568  <methodKlass>
>   17:        908377       34042760  [Lscala.collection.immutable.HashMap;
>   18:       1100000       26400000  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:       1100000       26400000  org.apache.spark.mllib.linalg.SparseVector
>   20:         20206       25979864  <constantPoolKlass>
>   21:       1000000       24000000  org.apache.spark.mllib.tree.impl.TreePoint
>   22:       1000000       24000000  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:        908332       21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24:         20206       20158864  <instanceKlassKlass>
>   25:         17023       14380352  <constantPoolCacheKlass>
>   26:            16       13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:        445797       10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to