Hi,
I am testing RandomForestClassification with 50gb of data which is cached
in memory. I have 64gb of ram, in which 28gb is used for original dataset
caching.
When I run random forest, it caches around 300GB of intermediate data which
un caches the original dataset. This caching is triggered by below code in
RandomForest.scala
```
val baggedInput = BaggedPoint
.convertToBaggedRDD(treeInput, strategy.subsamplingRate,
numTrees, withReplacement, seed)
.persist(StorageLevel.MEMORY_AND_DISK)
```
As I don't have control over storage level, I cannot make sure original
dataset stays in memory for other interactive tasks when random forest is
running.
Is it a good idea to make this storage level a user parameter? If so I can
open a jira issue and give pr for the same.
--
Regards,
Madhukara Phatak
http://datamantra.io/