I am trying to parallelize a task across columns of a data.table using foreach and doParallel. My data is large relative to my system memory (about 40%) so I'm avoiding making any copies of the full table. The function I am parallelizing is pretty simple, taking as input a fixed set of columns and a single unique column per task. I'd like to send only the small subset of columns actually needed to the worker nodes. I'd also like the option to only send a subset of rows to the worker nodes. My initial attempts to parallelize did not work as expected, and seemed to copy the entire data.table to every worker node.
### start code ### library(data.table) library(foreach) library(doParallel) registerDoParallel() anotherVar = "Y" someVars = paste0("X", seq(1:20)) N = 100000000 # I've chosen N such that my Rsession consumes ~15GB of memory according to top right after DT is created DT = as.data.table(matrix(rnorm(21*N), ncol=21)) setnames(DT, c(anotherVar, someVars)) MyFun = function(inDT, inX, inY){ cor(inDT[[inX]], inDT[[inY]]) } #Warning: Will throw an error on the mac GUI corrWithY_1 = foreach(i = 1:length(someVars), .combine = c) %dopar% MyFun(DT[,c(anotherVar, someVars[i]), with=FALSE], someVars[i], anotherVar) # Watching top, all of the slave nodes also appear to consume the full ~15Gb of system memory gc() # So I tried creating an entirely separate subset of DT to send to the slave nodes, and then removing it by hand. # This task, too, appears to take ~15GB of memory per slave node according to top. MyFun2 = function(DT, anotherVar, uniqueVar){ tmpData = DT[, c(anotherVar, uniqueVar), with=FALSE] out = MyFun(tmpData, anotherVar, uniqueVar) rm(tmpData) return(out) } corrWithY_2 = foreach(i = 1:length(someVars), .combine = c) %dopar% MyFun2(DT, anotherVar, someVars[i]) ### end code ### Another thing I've tried is to send only the name of DT and it's environment to the slave nodes, but `get`doesn't seem to be able to only get a subset of rows from DT, as I would need to do frequently Questions: 1. Is top accurately reflecting my R session's memory usage? 2. If so, is there a way to parallelize over the columns of a data.table without copying the entire table to every slave node? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.