Hi, I have a script that at some point generates a list of DataFrame objects which are rather large matrices. I then feed this list to BiocParallel::bplapply() and process them.
Previously, I noticed that in our SGE managed cluster using MulticoreParam() lead to 5 to 8 times higher memory usage as I posted in https://support.bioconductor.org/p/62551/#62877. Martin posted in https://support.bioconductor.org/p/62551/#62880 that "Probably the tools used to assess memory usage are misleading you." This could be true, but they are the tools that determine memory usage for all jobs in the cluster. Meaning that if my memory usage blows up according to these tools, my jobs get killed. That was with R 3.1.x and in particular running https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh with $ sh step1-fullCoverage.sh brainspan which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores. I recently tried to reproduce this (to check changes in run time given rtracklayer's improvements with BigWig files) using R 3.2.x and the memory went up to 450 GB before the job got killed given the maximum memory I specified for the job. The same is true using R 3.2.0. Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one bug fix is different, for other code not used in this script). I know that BiocParallel changed quite a bit between those versions, and in particular SnowParam(). So that's why my prime suspect is BiocParallel. I made a smaller reproducible example which you can view at http://lcolladotor.github.io/SnowParam-memory/. This example uses a list of data frames with random data, and also uses 10 cores. You can see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam() does use more memory than SnowParam(), as reported by SGE. Beyond the actual session info differences due to changes in BiocParalell's implementation, I noticed that the cluster type changed from PSOCK to SOCK. I ignore if this could explain the memory increase. The example doesn't generate the huge fold change between R 3.1.x and the other two versions (still 1.27x > 1x) that I see with my analysis script, so in that sense it's not the best example for the problem I'm observing. My tests with https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh were between June 23rd and 28th, so maybe some recent changes in BiocParallel addressed this issue. I'm not sure how to proceed now. One idea is to make another example with the same type of objects and operations I use in my analysis script. A second one is to run my analysis script with SerialParam() on the different R versions to check if they use different amounts of memory which would suggest that the memory issue is not caused by SnowParam(). For example, maybe changes in rtracklayer are the ones driving the huge memory changes I'm seeing in my analysis scripts. However, I don't really suspect rtracklayer given the memory load reported that I checked manually a couple of times with "qmem". I believe that the memory blows up at https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124 which in turn uses derfinder::filterData(). This function imports: '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges Rle, DataFrame from S4Vectors Reduce method from S4Vectors https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51 Best, Leo History of analysis scripts doesn't reveal any other leads https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel