Hi, I opened a jira. https://issues.apache.org/jira/browse/SPARK-20723
Can some one have a look? On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak <phatak....@gmail.com> wrote: > Hi, > > I am testing RandomForestClassification with 50gb of data which is cached > in memory. I have 64gb of ram, in which 28gb is used for original dataset > caching. > > When I run random forest, it caches around 300GB of intermediate data > which un caches the original dataset. This caching is triggered by below > code in RandomForest.scala > > ``` > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, > numTrees, withReplacement, seed) > .persist(StorageLevel.MEMORY_AND_DISK) > > ``` > > As I don't have control over storage level, I cannot make sure original > dataset stays in memory for other interactive tasks when random forest is > running. > > Is it a good idea to make this storage level a user parameter? If so I can > open a jira issue and give pr for the same. > > -- > Regards, > Madhukara Phatak > http://datamantra.io/ > -- Regards, Madhukara Phatak http://datamantra.io/