Hi, I am testing RandomForestClassification with 50gb of data which is cached in memory. I have 64gb of ram, in which 28gb is used for original dataset caching.
When I run random forest, it caches around 300GB of intermediate data which un caches the original dataset. This caching is triggered by below code in RandomForest.scala ``` val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, seed) .persist(StorageLevel.MEMORY_AND_DISK) ``` As I don't have control over storage level, I cannot make sure original dataset stays in memory for other interactive tasks when random forest is running. Is it a good idea to make this storage level a user parameter? If so I can open a jira issue and give pr for the same. -- Regards, Madhukara Phatak http://datamantra.io/