Ewan Higgs created SPARK-13434:
----------------------------------
Summary: Reduce Spark RandomForest memory footprint
Key: SPARK-13434
URL: https://issues.apache.org/jira/browse/SPARK-13434
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.6.0
Environment: Linux
Reporter: Ewan Higgs
The RandomForest implementation can easily run out of memory on moderate
datasets. This was raised in the a user's benchmarking game on github
(https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was
a tracking issue, but I couldn't fine one.
Using Spark 1.6, a user of mine is running into problems running the
RandomForest training on largish datasets on machines with 64G memory and the
following in {{spark-defaults.conf}}:
{code}
spark.executor.cores 2
spark.executor.instances 199
spark.executor.memory 10240M
{code}
I reproduced the excessive memory use from the benchmark example (using an
input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell
--driver-memory 30G --executor-memory 30G}} and have a heap profile from a
single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample
every 5 seconds and at the peak it looks like this:
{code}
num #instances #bytes class name
----------------------------------------------
1: 5428073 8458773496 [D
2: 12293653 4124641992 [I
3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node
4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict
5: 72853787 1165660592 scala.Some
6: 16263408 910750848
org.apache.spark.mllib.tree.model.InformationGainStats
7: 72969 390492744 [B
8: 3327008 133080320
org.apache.spark.mllib.tree.impl.DTStatsAggregator
9: 3754500 120144000 scala.collection.immutable.HashMap$HashMap1
10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split
11: 3534946 84838704
org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
12: 3764745 60235920 java.lang.Integer
13: 3327008 53232128
org.apache.spark.mllib.tree.impurity.EntropyAggregator
14: 380804 45361144 [C
15: 268887 34877128 <constMethodKlass>
16: 268887 34431568 <methodKlass>
17: 908377 34042760 [Lscala.collection.immutable.HashMap;
18: 1100000 26400000
org.apache.spark.mllib.regression.LabeledPoint
19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector
20: 20206 25979864 <constantPoolKlass>
21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint
22: 1000000 24000000 org.apache.spark.mllib.tree.impl.BaggedPoint
23: 908332 21799968
scala.collection.immutable.HashMap$HashTrieMap
24: 20206 20158864 <instanceKlassKlass>
25: 17023 14380352 <constantPoolCacheKlass>
26: 16 13308288
[Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
27: 445797 10699128 scala.Tuple2
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]