I have had (potentially transient and environment-related) problems with
bplapply
in gQTLstats.   I substituted the foreach abstractions and the code
worked.  I still
have difficulty seeing how to diagnose the trouble I ran into.

I'd suggest that you code so that you can easily substitute parallel- or
foreach- or
BatchJobs-based cluster control.  This can help crudely isolate the source
of trouble.

It would be very nice to have a way of measuring resource usage in cluster
settings,
both for diagnosis and strategy selection.  For jobs that succeed,
BatchJobs records
memory used in its registry database, based on gc().  I would hope that
there are
tools that could be used to help one figure out how to factor a task so
that it is feasible
given some view of environment constraints.

It might be useful for you to build an AMI and then a cluster that allows
replication of
the condition you are seeing on EC2.  This could help with diagnosis and
might be
a basis for defining better instrumentation tools for both diagnosis and
planning.

On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcoll...@jhu.edu>
wrote:

> Hi,
>
> I have a script that at some point generates a list of DataFrame
> objects which are rather large matrices. I then feed this list to
> BiocParallel::bplapply() and process them.
>
> Previously, I noticed that in our SGE managed cluster using
> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
> in https://support.bioconductor.org/p/62551/#62877. Martin posted in
> https://support.bioconductor.org/p/62551/#62880 that "Probably the
> tools used to assess memory usage are misleading you." This could be
> true, but they are the tools that determine memory usage for all jobs
> in the cluster. Meaning that if my memory usage blows up according to
> these tools, my jobs get killed.
>
> That was with R 3.1.x and in particular running
>
> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
> with
>
> $ sh step1-fullCoverage.sh brainspan
>
> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
> I recently tried to reproduce this (to check changes in run time given
> rtracklayer's improvements with BigWig files) using R 3.2.x and the
> memory went up to 450 GB before the job got killed given the maximum
> memory I specified for the job. The same is true using R 3.2.0.
>
> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
> bug fix is different, for other code not used in this script). I know
> that BiocParallel changed quite a bit between those versions, and in
> particular SnowParam(). So that's why my prime suspect is
> BiocParallel.
>
> I made a smaller reproducible example which you can view at
> http://lcolladotor.github.io/SnowParam-memory/. This example uses a
> list of data frames with random data, and also uses 10 cores. You can
> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
> does use more memory than SnowParam(), as reported by SGE. Beyond the
> actual session info differences due to changes in BiocParalell's
> implementation, I noticed that the cluster type changed from PSOCK to
> SOCK. I ignore if this could explain the memory increase.
>
> The example doesn't generate the huge fold change between R 3.1.x and
> the other two versions (still 1.27x > 1x) that I see with my analysis
> script, so in that sense it's not the best example for the problem I'm
> observing. My tests with
>
> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
> were between June 23rd and 28th, so maybe some recent changes in
> BiocParallel addressed this issue.
>
>
> I'm not sure how to proceed now. One idea is to make another example
> with the same type of objects and operations I use in my analysis
> script.
>
> A second one is to run my analysis script with SerialParam() on the
> different R versions to check if they use different amounts of memory
> which would suggest that the memory issue is not caused by
> SnowParam(). For example, maybe changes in rtracklayer are the ones
> driving the huge memory changes I'm seeing in my analysis scripts.
>
> However, I don't really suspect rtracklayer given the memory load
> reported that I checked manually a couple of times with "qmem". I
> believe that the memory blows up at
>
> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124
> which in turn uses derfinder::filterData(). This function imports:
>
> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
> Rle, DataFrame from S4Vectors
> Reduce method from S4Vectors
>
> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51
>
>
> Best,
> Leo
>
>
> History of analysis scripts doesn't reveal any other leads
>
> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh
>
> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R
>
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to