I have had (potentially transient and environment-related) problems with bplapply in gQTLstats. I substituted the foreach abstractions and the code worked. I still have difficulty seeing how to diagnose the trouble I ran into.
I'd suggest that you code so that you can easily substitute parallel- or foreach- or BatchJobs-based cluster control. This can help crudely isolate the source of trouble. It would be very nice to have a way of measuring resource usage in cluster settings, both for diagnosis and strategy selection. For jobs that succeed, BatchJobs records memory used in its registry database, based on gc(). I would hope that there are tools that could be used to help one figure out how to factor a task so that it is feasible given some view of environment constraints. It might be useful for you to build an AMI and then a cluster that allows replication of the condition you are seeing on EC2. This could help with diagnosis and might be a basis for defining better instrumentation tools for both diagnosis and planning. On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcoll...@jhu.edu> wrote: > Hi, > > I have a script that at some point generates a list of DataFrame > objects which are rather large matrices. I then feed this list to > BiocParallel::bplapply() and process them. > > Previously, I noticed that in our SGE managed cluster using > MulticoreParam() lead to 5 to 8 times higher memory usage as I posted > in https://support.bioconductor.org/p/62551/#62877. Martin posted in > https://support.bioconductor.org/p/62551/#62880 that "Probably the > tools used to assess memory usage are misleading you." This could be > true, but they are the tools that determine memory usage for all jobs > in the cluster. Meaning that if my memory usage blows up according to > these tools, my jobs get killed. > > That was with R 3.1.x and in particular running > > https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh > with > > $ sh step1-fullCoverage.sh brainspan > > which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores. > I recently tried to reproduce this (to check changes in run time given > rtracklayer's improvements with BigWig files) using R 3.2.x and the > memory went up to 450 GB before the job got killed given the maximum > memory I specified for the job. The same is true using R 3.2.0. > > Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one > bug fix is different, for other code not used in this script). I know > that BiocParallel changed quite a bit between those versions, and in > particular SnowParam(). So that's why my prime suspect is > BiocParallel. > > I made a smaller reproducible example which you can view at > http://lcolladotor.github.io/SnowParam-memory/. This example uses a > list of data frames with random data, and also uses 10 cores. You can > see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam() > does use more memory than SnowParam(), as reported by SGE. Beyond the > actual session info differences due to changes in BiocParalell's > implementation, I noticed that the cluster type changed from PSOCK to > SOCK. I ignore if this could explain the memory increase. > > The example doesn't generate the huge fold change between R 3.1.x and > the other two versions (still 1.27x > 1x) that I see with my analysis > script, so in that sense it's not the best example for the problem I'm > observing. My tests with > > https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh > were between June 23rd and 28th, so maybe some recent changes in > BiocParallel addressed this issue. > > > I'm not sure how to proceed now. One idea is to make another example > with the same type of objects and operations I use in my analysis > script. > > A second one is to run my analysis script with SerialParam() on the > different R versions to check if they use different amounts of memory > which would suggest that the memory issue is not caused by > SnowParam(). For example, maybe changes in rtracklayer are the ones > driving the huge memory changes I'm seeing in my analysis scripts. > > However, I don't really suspect rtracklayer given the memory load > reported that I checked manually a couple of times with "qmem". I > believe that the memory blows up at > > https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124 > which in turn uses derfinder::filterData(). This function imports: > > '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges > Rle, DataFrame from S4Vectors > Reduce method from S4Vectors > > https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51 > > > Best, > Leo > > > History of analysis scripts doesn't reveal any other leads > > https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh > > https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel