Dear Simon,

Thank you for your response! I was not able to provide you with the requested 
information at an earlier stage since I am not a full time academic / 

An example of a bam call that may result in an error is:
bam(formula=Di ~ 1 + Gender + I(L_Dis==0) + s(DisPerc, by=as.numeric(L_Dis==2), 
bs='cr'), offset=log(Ei*Mi), family=poisson, data=dtPF, method="fREML", 
discrete=TRUE, gc.level=2);

Here, dtPF is a data.table object with 22m rows and 21 columns/variables, 
Gender is a factor variable, L_Dis is an integer variable which equals 0 if 
DisPerc is missing (manually set to 0.1), equals 1 if DisPerc==0, and equals 2 
if DisPerc>0 (ranges from 0 to 0.25).

The sessionInfo() provides the following output:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default

 [1] LC_CTYPE=en_US       LC_NUMERIC=C         LC_TIME=en_US
 [7] LC_PAPER=en_US       LC_NAME=C            LC_ADDRESS=C

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base

other attached packages:
[1] mgcv_1.8-27       nlme_3.1-137      data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.4.3  Matrix_1.2-16   tools_3.4.3     splines_3.4.3
[5] grid_3.4.3      lattice_0.20-38

Thank you for your help!


On 15/03/2019 09:09, Frank van Berkum wrote:
> Dear Community,
> In our current research we are trying to fit Generalized Additive Models to a 
> large dataset. We are using the package mgcv in R.
> Our dataset contains about 22 million records with less than 20 risk factors 
> for each observation, so in our case n>>p. The dataset covers the period 2006 
> until 2011, and we analyse both the complete dataset and datasets in which we 
> leave out a single year. The latter part is done to analyse robustness of the 
> results. We understand k-fold cross validation may seem more appropriate, but 
> out approach is closer to what is done in practice (how will one additional 
> year of information affect your estimates?).
> We use the function bam as advocated in Wood et al. (2017), and we apply the 
> following options: bam(�, discrete=TRUE, chunk.size=10000, gc.level=1). We 
> run these analyses on a computer cluster (see 
> for details), and the 
> job is allocated to a node within the computer cluster. A node has at least 
> 16 cores and 64Gb memory.
> We had expected 64Gb of memory to be sufficient for these analyses, 
> especially since the bam function is built specifically for large datasets. 
> However, when applying this function to the different datasets described 
> above with different regression specifications (different risk factors 
> included in the linear predictor), we sometimes obtain errors of the 
> following form.
> Error in XWyd(G$Xd, w, z, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop, ar.stop, 
>  :
>    'Calloc' could not allocate memory (22624897 of 8 bytes)
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWyd
> Execution halted
> Warning message:
> system call failed: Cannot allocate memory
> Error in Xbd(G$Xd, coef, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop) :
>    'Calloc' could not allocate memory (18590685 of 8 bytes)
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> Xbd
> Execution halted
> Warning message:
> system call failed: Cannot allocate memory
> Error: cannot allocate vector of size 1.7 Gb
> Timing stopped at: 2 0.556 4.831
> Error in system.time(oo <- .C(C_XWXd0, XWX = as.double(rep(0, (pt + nt)^2)),  
> :
>    'Calloc' could not allocate memory (55315650 of 24 bytes)
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWXd -> system.time -> .C
> Timing stopped at: 1.056 1.396 2.459
> Execution halted
> Warning message:
> system call failed: Cannot allocate memory
> The errors seem to arise at different stages in the optimization process. We 
> have analysed whether these errors disappear if different settings are used 
> (different chunk.size, different gc.level), but this does not resolve our 
> problem. Also, the errors occur on different datasets when using different 
> settings, and even when using the same settings it is possible that an error 
> that occurred on dataset X in one run it does not necessarily occur on 
> dataset X in a different run. When using the discrete=TRUE option, 
> optimization can be parallelized, but we have chosen to not employ this 
> feature to ensure memory does not have to be shared between parallel 
> processes.
> Naturally I cannot share our dataset with you which makes the problem 
> difficult to analyse. However, based on your collective knowledge, could you 
> pinpoint us to where the problem may occur? Is it something within the C-code 
> used within the package (as the last error seems to indicate), or is it 
> related to the computer cluster?
> Any help or insights is much appreciated.
> Kind regards,
> Frank
