Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

Valerie Obenchain Wed, 15 Jul 2015 21:00:33 -0700

Correction below.

On 07/15/2015 04:14 PM, Valerie Obenchain wrote:

Hi,


BiocParallel in release and devel are quite similar so I'd like to
narrow the focus to "before the changes to SnowParam" and after. This
means BiocParallel 1.0.3 (R 3.1) vs BiocParallel 1.3.34 (R 3.2.1) which
is the current devel.

(1) master vs worker memory

I'm more concerned about memory use on the master than the workers. This
is where the code has changed the most and where the data are touched
the most, ie, split into tasks and divided among workers. So it's the
numbers in the mem_email files instead of the logs that I'm more focused
on right now.

(2) virtual vs actual memory

The mem_email files show the virtual memory requested (Max vmem) for the
job but not actual memory used. Can you output the actual used? That
would be helpful.

It does look like more memory is requested (not necessary used) in
BiocParallel 1.3.34 vs 1.0.3.

(3) SGE and mem_free

This is more of an fyi -

It was my understanding that once Grid engine offered the mem_free
option, h_vmem was no longer necessary and actively discouraged by many
cluster admins.

A Grid engine update can enable cgroups which allow mem_free to beenforced, ie, can kill jobs that exceed specified memory. It's in thiscase where mem_free serves the greatest purpose and h_vmem isn't souseful. Not sure if your cluster has cgroups enabled.


Valerie


mem_free is used to track the available physical memory on the node and
when a job is submitted to that node SGE subtracts the requested value
from the available memory. h_vmem sets a threshold on the virtural
memory requested - a job is (usually) killed when the requested memory
exceeds this amount. Because many apps request more memory than they use
scheduling by h_vmem causes the cluster to be very inefficient as you
are reserving chunks of memory that never get used.

You may want to follow up with your cluster admin about this. Maybe
you've already gone down this road and there are relatively small
default memory allocations per slot so it's necessary to specify h_vmem.


I like how you've made information related to this bug report available
in github. Having that space to record thoughts, store log/memory files
and provide code really works well.

Valerie



On 07/14/2015 01:33 PM, Leonardo Collado Torres wrote:

Hi Valerie,

I have re-run my two examples twice using "log = TRUE" and updated the
output at http://lcolladotor.github.io/SnowParam-memory/ As I was
writing this email (all morning...), I made a 4th run where I save the
gc() information to compare against R 3.1.x That fourth run kind of
debunked what I was taking away from replicate runs 2 and 3.



## Between runs


Between runs there's almost no variability with R 3.1.x in the memory
used, a little bit with 3.2.1 and more with R 3.2.0. The variability
shown could be due to using BiocParallel 1.2.9 in runs 2 to 4 vs 1.2.7
in run 1 for R 3.2.0 and 1.3.31 vs 1.3.34 in R 3.2.1.



## gc() output differences

Now, from the gc() output from using "log = TRUE", I don't see nearly
any differences in the first example between R:

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.o6463210

* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.x.o6463213


which makes sense given that the memory was the same

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.txt#L24

* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.x.txt#L24


However, I do notice a large difference versus R 3.1.x

* log
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.1.x.o6463536

* mem info
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.1.x.txt#L50




In the derfinder example, the memory goes from 10.9 GB in R 3.2.0 to
12.87 GB in R 3.2.1 (run 2) with SnowParam(). The 18% increase
reported there is kind of similar to the increase from comparing the
max used mb output from gc():

# R 3.2.1 vs 3.2.0
* gc() max mem used mb ratio: (303.6 + 548.9) / (251.3 + 547.6) =~ 1.07
* max mem used in GB from cluster email:12.871 / 10.904  =~ 1.18

 From this observation, maybe using "log = TRUE" and checking that the
memory used reported by gc() goes down will be useful for testing
changes in BiocParallel() to see if they increase/decrease memory use.
When I run the tests locally (using 2 cores), I get similar ratios;
data at
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/local_run.txt.



However, the same type of comparison but now between R 3.2.1 against R
3.1.x shows that the numbers can be off. Although the 2.8 ratio is
closer to what I saw in my analysis scenario ( > 2.5).

# R 3.2.1 vs 3.1.x
* gc() max mem used mb ratio: (303.6 + 548.9) / (236 + 66) =~ 2.83
* max mem used in GB from cluster email:12.871 / 7.175  =~ 1.79

Numbers from:
# gc() info
* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.x.o6463204#L15-L16

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.o6463201#L15-L16

* 3.1.x
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.1.x.o6463545#L10-L11

# cluster mem info
* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.x.txt#L24

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.txt#L24

* 3.1.x
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.1.x.txt#L50







## Max mem vs gc()


Comparing the memory used in GB from the cluster email to the max used
mb from gc() multiplied by 10 (number of cores used), I see that the
ratio is somewhat consistent.


# first example, R 3.2.0, SnowParam() run 2
7.285 * 1024 / (10 * (23.9 + 461.8)) =~ 1.54
# first example, R 3.2.1, SnowParam(), run 2
7.286 * 1024 / (10 * (24 + 461.8)) =~ 1.54

# derfinder example, R 3.2.0, SnowParam(), run 2
10.904 * 1024 / (10 * (251.3 + 547.6)) =~ 1.4
# derfinder example, R 3.2.1, SnowParam(), run 2
12.871 * 1024 / (10 * (303.6 + 548.9)) =~ 1.55

# derfinder example, R 3.2.0, MulticoreParam(), run 2
13.789 * 1024 / (10 * (230.8 + 770.2)) =~ 1.41
# derfinder example, R 3.2.1, MulticoreParam(), run 2
13.671 * 1024 / (10 * (245.9 + 757.5)) =~ 1.4

And from this other observation, maybe I can use the gc() output from
"log = TRUE" to get an idea if the memory use reported by my cluster
is in line with previous runs, or if there's a cluster issue. This
ratio could also be used to compare different cluster environments to
see which ones are reporting greater/lower memory use.


However, the above ratios are different with R 3.1.x

# first example, R 3.1.x, SnowParam(), run 4
5.036 * 1024 / (10 * (32.1 + 218.5)) =~ 2.06

# derfinder example, R 3.1.x, SnowParam(), run 4
7.175 * 1024 / (10 * (236 + 66)) =~ 2.43
# derfinder example, R 3.1.x, MulticoreParam(), run 4
8.473 * 1024 / (10 * (240.7 + 189.1)) =~ 2.02


I'm not sure if this is a hint, but the largest difference between R
3.1.x and the other two is in the max used mb from Vcells.




Summarizing, I was thinking that

(A) we could use the output from gc() to compare between versions and
check which changes lowered the memory required,
and (B) estimate the actual memory needed as measured by the cluster
as well as compare cluster environments.

However, the gc() numbers from R 3.1.x (only available on the 4th
replicate run) don't seem to support these thoughts.



Or do you interpret these numbers differently?



Best,
Leo

On Sun, Jul 12, 2015 at 11:00 AM, Valerie Obenchain
<voben...@fredhutch.org> wrote:

Hi Leo,

Thanks for the sample code I'll take a look.

You're right, SnowParam has changed quite at bit - logging, error
handling
etc. The memory use you're seeing is a concern - thanks for reporting
it.

As an fyi, the log output for SnowParam and MulticoreParam now includes
gc(), system.time() and other stats from the workers.

SnowParam(log = TRUE)


Valerie




On 07/10/2015 01:12 PM, Leonardo Collado Torres wrote:


Hi,

I ran my example code with SerialParam() which had a negligible 4%
memory increase between R 3.2.x and 3.1.x This 4% could very well
fluctuate a little bit and might be non significantly different from 0
if I run the test more times.

I also added a second example using code based on my analysis script.
With SerialParam(), the memory change is 13%, but with SnowParam()
it's 82% between the R versions mentioned already using 10 cores. It's
still far from the > 150% increase (2.5 fold change) I'm seeing with
the real data.

I initially thought that these observations ruled out everything else
except SnowParam(). However, maybe the initial 13% memory increase
multiplied by 10 (well, less then linear) is what I'm seeing with 10
cores (82% increase).

The updated information is available at
http://lcolladotor.github.io/SnowParam-memory/



As for what Vincent suggested of an AMI and EC2, I don't have
experience with them. I'm not sure I'll be able to look into them and
create a reproducible environment.


Cheers,
Leo

On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
<st...@channing.harvard.edu> wrote:


I have had (potentially transient and environment-related) problems
with
bplapply
in gQTLstats.   I substituted the foreach abstractions and the code
worked.
I still
have difficulty seeing how to diagnose the trouble I ran into.

I'd suggest that you code so that you can easily substitute
parallel- or
foreach- or
BatchJobs-based cluster control.  This can help crudely isolate the
source
of trouble.

It would be very nice to have a way of measuring resource usage in
cluster
settings,
both for diagnosis and strategy selection.  For jobs that succeed,
BatchJobs
records
memory used in its registry database, based on gc().  I would hope
that
there are
tools that could be used to help one figure out how to factor a
task so
that
it is feasible
given some view of environment constraints.

It might be useful for you to build an AMI and then a cluster that
allows
replication of
the condition you are seeing on EC2.  This could help with
diagnosis and
might be
a basis for defining better instrumentation tools for both
diagnosis and
planning.

On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres
<lcoll...@jhu.edu>
wrote:



Hi,

I have a script that at some point generates a list of DataFrame
objects which are rather large matrices. I then feed this list to
BiocParallel::bplapply() and process them.

Previously, I noticed that in our SGE managed cluster using
MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
in https://support.bioconductor.org/p/62551/#62877. Martin posted in
https://support.bioconductor.org/p/62551/#62880 that "Probably the
tools used to assess memory usage are misleading you." This could be
true, but they are the tools that determine memory usage for all jobs
in the cluster. Meaning that if my memory usage blows up according to
these tools, my jobs get killed.

That was with R 3.1.x and in particular running


https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh

with

$ sh step1-fullCoverage.sh brainspan

which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
I recently tried to reproduce this (to check changes in run time
given
rtracklayer's improvements with BigWig files) using R 3.2.x and the
memory went up to 450 GB before the job got killed given the maximum
memory I specified for the job. The same is true using R 3.2.0.

Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
bug fix is different, for other code not used in this script). I know
that BiocParallel changed quite a bit between those versions, and in
particular SnowParam(). So that's why my prime suspect is
BiocParallel.

I made a smaller reproducible example which you can view at
http://lcolladotor.github.io/SnowParam-memory/. This example uses a
list of data frames with random data, and also uses 10 cores. You can
see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
does use more memory than SnowParam(), as reported by SGE. Beyond the
actual session info differences due to changes in BiocParalell's
implementation, I noticed that the cluster type changed from PSOCK to
SOCK. I ignore if this could explain the memory increase.

The example doesn't generate the huge fold change between R 3.1.x and
the other two versions (still 1.27x > 1x) that I see with my analysis
script, so in that sense it's not the best example for the problem
I'm
observing. My tests with


https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh

were between June 23rd and 28th, so maybe some recent changes in
BiocParallel addressed this issue.


I'm not sure how to proceed now. One idea is to make another example
with the same type of objects and operations I use in my analysis
script.

A second one is to run my analysis script with SerialParam() on the
different R versions to check if they use different amounts of memory
which would suggest that the memory issue is not caused by
SnowParam(). For example, maybe changes in rtracklayer are the ones
driving the huge memory changes I'm seeing in my analysis scripts.

However, I don't really suspect rtracklayer given the memory load
reported that I checked manually a couple of times with "qmem". I
believe that the memory blows up at


https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124

which in turn uses derfinder::filterData(). This function imports:

'[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges
Rle, DataFrame from S4Vectors
Reduce method from S4Vectors



https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51



Best,
Leo


History of analysis scripts doesn't reveal any other leads


https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh



https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

Reply via email to