Dear Simon, Thank you for your response! I was not able to provide you with the requested information at an earlier stage since I am not a full time academic / researcher.
An example of a bam call that may result in an error is: bam(formula=Di ~ 1 + Gender + I(L_Dis==0) + s(DisPerc, by=as.numeric(L_Dis==2), bs='cr'), offset=log(Ei*Mi), family=poisson, data=dtPF, method="fREML", discrete=TRUE, gc.level=2); Here, dtPF is a data.table object with 22m rows and 21 columns/variables, Gender is a factor variable, L_Dis is an integer variable which equals 0 if DisPerc is missing (manually set to 0.1), equals 1 if DisPerc==0, and equals 2 if DisPerc>0 (ranges from 0 to 0.25). The sessionInfo() provides the following output: R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch) Matrix products: default BLAS/LAPACK: /sara/eb/Debian9/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_sandybridgep-r0.2.20.so locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=en_US LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] mgcv_1.8-27 nlme_3.1-137 data.table_1.12.0 loaded via a namespace (and not attached): [1] compiler_3.4.3 Matrix_1.2-16 tools_3.4.3 splines_3.4.3 [5] grid_3.4.3 lattice_0.20-38 Thank you for your help! Frank ________________________________ From: R-help <r-help-boun...@r-project.org> on behalf of r-help-requ...@r-project.org <r-help-requ...@r-project.org> Sent: Saturday, March 16, 2019 11:00 AM To: r-help@r-project.org Subject: R-help Digest, Vol 193, Issue 16 Send R-help mailing list submissions to r-help@r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/r-help or, via email, send a message with subject or body 'help' to r-help-requ...@r-project.org You can reach the person managing the list at r-help-ow...@r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of R-help digest..." Date: Fri, 15 Mar 2019 12:31:31 +0000 From: Simon Wood <simon.w...@bath.edu> To: r-help@r-project.org Subject: Re: [R] [mgcv] Memory issues with bam() on computer cluster Message-ID: <d8e2643a-d960-0d86-4296-f0c7fcf14...@bath.edu> Content-Type: text/plain; charset="utf-8" Can you supply the results of sessionInfo() please, and the full bam call that causes this. best, Simon (mgcv maintainer) On 15/03/2019 09:09, Frank van Berkum wrote: > Dear Community, > > In our current research we are trying to fit Generalized Additive Models to a > large dataset. We are using the package mgcv in R. > > Our dataset contains about 22 million records with less than 20 risk factors > for each observation, so in our case n>>p. The dataset covers the period 2006 > until 2011, and we analyse both the complete dataset and datasets in which we > leave out a single year. The latter part is done to analyse robustness of the > results. We understand k-fold cross validation may seem more appropriate, but > out approach is closer to what is done in practice (how will one additional > year of information affect your estimates?). > > We use the function bam as advocated in Wood et al. (2017), and we apply the > following options: bam(�, discrete=TRUE, chunk.size=10000, gc.level=1). We > run these analyses on a computer cluster (see > https://userinfo.surfsara.nl/systems/lisa/description for details), and the > job is allocated to a node within the computer cluster. A node has at least > 16 cores and 64Gb memory. > > We had expected 64Gb of memory to be sufficient for these analyses, > especially since the bam function is built specifically for large datasets. > However, when applying this function to the different datasets described > above with different regression specifications (different risk factors > included in the linear predictor), we sometimes obtain errors of the > following form. > > Error in XWyd(G$Xd, w, z, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop, ar.stop, > : > > 'Calloc' could not allocate memory (22624897 of 8 bytes) > > Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWyd > > Execution halted > > Warning message: > > system call failed: Cannot allocate memory > > Error in Xbd(G$Xd, coef, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop) : > > 'Calloc' could not allocate memory (18590685 of 8 bytes) > > Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> Xbd > > Execution halted > > Warning message: > > system call failed: Cannot allocate memory > > Error: cannot allocate vector of size 1.7 Gb > > Timing stopped at: 2 0.556 4.831 > > Error in system.time(oo <- .C(C_XWXd0, XWX = as.double(rep(0, (pt + nt)^2)), > : > > 'Calloc' could not allocate memory (55315650 of 24 bytes) > > Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWXd -> system.time -> .C > > Timing stopped at: 1.056 1.396 2.459 > > Execution halted > > Warning message: > > system call failed: Cannot allocate memory > > The errors seem to arise at different stages in the optimization process. We > have analysed whether these errors disappear if different settings are used > (different chunk.size, different gc.level), but this does not resolve our > problem. Also, the errors occur on different datasets when using different > settings, and even when using the same settings it is possible that an error > that occurred on dataset X in one run it does not necessarily occur on > dataset X in a different run. When using the discrete=TRUE option, > optimization can be parallelized, but we have chosen to not employ this > feature to ensure memory does not have to be shared between parallel > processes. > > Naturally I cannot share our dataset with you which makes the problem > difficult to analyse. However, based on your collective knowledge, could you > pinpoint us to where the problem may occur? Is it something within the C-code > used within the package (as the last error seems to indicate), or is it > related to the computer cluster? > > Any help or insights is much appreciated. > > Kind regards, > > Frank > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Simon Wood, School of Mathematics, University of Bristol, BS8 1TW UK https://people.maths.bris.ac.uk/~sw15190/ [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.