Dear R gurus,

I have a very embarrassingly parallelizable job that I am trying to speed up 
with snow on our local cluster. Basically, I am doing ~50,000 t.test for a 
series of micro-array experiments, one gene at a time. Thus, I can easily 
spread the load across multiple processors and nodes.

So, I have a master list object that tells me what rows to pick up for each 
genes to do the t.test from series of microarray experiments containing 
~500,000 rows and x columns per experiments.

While trying to optimize my function using parLapply(), I quickly realized that 
I was not gaining any speed because every time a test was done on one of the 
item in the list, the 500,000 line by x column matrix had to be shipped along 
with the item in the list and the traffic time was actually longer than the 
computing time.

However, if I export the 500,000 object first across the spawned processes as 
in this mock script

cl <- makeCluster(nnodes,method)
mArrayData <- getData(experiments)
clusterExport(cl, 'mArrayData')

Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))

With a function that define the mArrayData argument as a default parameter as in

t.testFnc <- function(probeList, array=mArrayData){
    x <- array[probeList$A,]
    y <- array[probeList$B,]
     res <- doSomeTest(x,y)
    return(res)
}

Using this strategy, I was able to gain full advantage of my cluster and reduce 
the analysis time by the number of nodes I have in our cluster. The large data 
matrix was resident in each processes and didn't have to travel on the network 
every time a item from the list was pass to the function t.testFnc()

However, I quickly realized that this works (the call to clusterExport() ) only 
when I run the script one line at a time. When the process is enclosed in a 
function, the object mArrayData is not exported, presumably because it's not a 
global object from the Master process.

So, what is the alternative to push the content of an object to the slaves? The 
documentation in the snow package is a bit light and I couldn't find good 
example out there. I don't want to have the function getData() evaluated on 
each nodes because the argument to that functions are humongous and that would 
cause way too much traffic on the network. I want the result of the function 
getData(), the object mArrayData, propagated to the cluster only once and be 
available to downstream functions.

Hope this is clear and that a solution will be possible.

Many thanks

Marco

--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to