Re: [R] lost in the SNOW at 4 AM; parallelization confusion...

2008-08-23 Thread Martin Morgan
Hi Eric --

Eric Rupley <[EMAIL PROTECTED]> writes:

> Apologies at what must be a very basic question, but I have not found
> any clear examples on how to design the following
>
> I would like to run iterative analysis over several processors.  A toy
> example of the analysis is attached; for a resampling function run 1k
> times, with two different sets of conditioning variables i,j on some
> data vec...
>
> What is the usual way to attack such a problem using snow?  My
> understanding up to this point is that one should:
>
> (1) set the random seed to uncorrelate the processors' actions in
> select()
>
> (2) make a function myfunc(vec,i,j) which returns the item of interest
>
> (3) set up a wrapper which iterates through i,j, and makes the call to
> the cluster
>
> (4) call the cluster using clusterApply(cl,vec, myfunc)

I think you're on the right track. You say:

for (i in c(2,4)) { # a series of nested iterations...
for (j in c(5:6)) {
clusterApply(cl, vec, analysis.func, i, j)

The clusterApply says, for each element of vec, invoke
analysis.func. vec is of length 1000, so you invoke analysis.func 1000
times, and with the outer loops you're calling analysis func 2 * 2 *
1000 times.

In your single processor code you have

for (i in c(2,4)) { # a series of nested iterations...
for (j in c(5:6)) {
res <- analysis.func(vec,i,j)

which invokes analysis.func 2 * 2 times. A strategy is to convert your
'for' loops into an appropriate *apply function, which I might do as
(approximately)

> its <- expand.grid(i=c(2, 4), j=c(5, 6))
> mapply(analysis.func, its$i, its$j,
+MoreArgs=list(vec=vec))
[1] 120719.09  60403.20 144993.44  72468.66

(maybe you mean i=2:4, j=5:6 ?) and then to use the appropriate
cluster* function, e.g.,

> clusterMap(cl, analysis.func, its$i, its$j,
+MoreArgs=list(vec=vec))

Maybe it is now early enough (though not too early?) for that drink?

Martin

> I must be terribly confused based on the results attached belowany
> advice will be appreciated...
>
>
> Many thanks,
> Best,
> Eric
>
> --
>   Eric Rupley
>   University of Michigan, Museum of Anthropology
>   1109 Geddes Ave, Rm. 4013
>   Ann Arbor, MI 48109-1079
>
>   [EMAIL PROTECTED]
>   +1.734.276.8572
>
>
>
> # set up
> #
> # cl <- makeCluster(7)
> # 8 slaves are spawned successfully. 0 failed.
> #clusterSetupRNG(cl)
> #[1] "RNGstream"
>
>
> vec <- runif(1000,1,100)
> d <- NULL; c.j <- NULL;c.i <- NULL
>
> # the toy function
>
> analysis.func <- function (vec,i,j) {
> b <- NULL
> for (k in c(1:1000)) {
>   a <- sample(vec,1000,replace=T) #requires randoms...
>   b <- append(b, mean(a))
>   }
> c <- (sum(b)*j)/i
> return(c)
> }
>
>
> # the "analysis"
>
> system.time(for (i in c(2,4)) { # a series of nested iterations...
>
>   for (j in c(5:6)) {
>
> d <-  
> append( mean( as.numeric( clusterApply(cl,vec,analysis.func,i,j) ) ) ,
> d)
> # this is ugly and contorted; there has to be a better way?
> c.j <- append(j, c.j)
> c.i <- append(i, c.i)
> }
> })
>
> #   user  system elapsed
> #  9.758   0.291  48.771
> #>
>
> # but the old way is faster...
>
> d <- NULL; c.j <- NULL; c.i <- NULL # set up again
>
> system.time(for (i in c(2,4)) { # a series of nested iterations...
>
>   for (j in c(5:6)) {
>
> d <-append( mean( as.numeric( analysis.func(vec,i,j) )) ,d)
> # keeping it ugly for timing comparision...
> c.j <- append(j, c.j)
> c.i <- append(i, c.i)
> }
> })
>
>
> #   user  system elapsed
> #  0.299   0.002   0.299
> #>  # arrgrgrgrgrg!!!
>
> stopCluster(cl)
> #[1] 1
> sessionInfo()
> #R version 2.7.1 (2008-06-23)
> #i386-apple-darwin8.10.1
> #
> #locale:
> #en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> #
> #attached base packages:
> #[1] stats graphics  grDevices utils datasets  methods   base
> #
> #other attached packages:
> #[1] rlecuyer_0.1 boot_1.2-33  snow_0.3-3   Rmpi_0.5-5
> #
> #loaded via a namespace (and not attached):
> #[1] tools_2.7.1
> date()
> #[1] "Sat Aug 23 04:25:50 2008"
> #>
> #Too late for a drink. Pity.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] lost in the SNOW at 4 AM; parallelization confusion...

2008-08-23 Thread Eric Rupley



Apologies at what must be a very basic question, but I have not found  
any clear examples on how to design the following


I would like to run iterative analysis over several processors.  A toy  
example of the analysis is attached; for a resampling function run 1k  
times, with two different sets of conditioning variables i,j on some  
data vec...


What is the usual way to attack such a problem using snow?  My  
understanding up to this point is that one should:


(1) set the random seed to uncorrelate the processors' actions in  
select()


(2) make a function myfunc(vec,i,j) which returns the item of interest

(3) set up a wrapper which iterates through i,j, and makes the call to  
the cluster


(4) call the cluster using clusterApply(cl,vec, myfunc)

I must be terribly confused based on the results attached belowany  
advice will be appreciated...



Many thanks,
Best,
Eric

--
 Eric Rupley
 University of Michigan, Museum of Anthropology
 1109 Geddes Ave, Rm. 4013
 Ann Arbor, MI 48109-1079

 [EMAIL PROTECTED]
 +1.734.276.8572



# set up
#
# cl <- makeCluster(7)
#   8 slaves are spawned successfully. 0 failed.
#clusterSetupRNG(cl)
#[1] "RNGstream"


vec <- runif(1000,1,100)
d <- NULL; c.j <- NULL;c.i <- NULL

# the toy function

analysis.func <- function (vec,i,j) {
b <- NULL
for (k in c(1:1000)) {
a <- sample(vec,1000,replace=T) #requires randoms...
b <- append(b, mean(a))
}
c <- (sum(b)*j)/i
return(c)
}


# the "analysis"

system.time(for (i in c(2,4)) { # a series of nested iterations...

for (j in c(5:6)) {

d <-  
append( mean( as.numeric( clusterApply(cl,vec,analysis.func,i,j) ) ) ,  
d)

# this is ugly and contorted; there has to be a better way?
c.j <- append(j, c.j)
c.i <- append(i, c.i)
}
})

#   user  system elapsed
#  9.758   0.291  48.771
#>

# but the old way is faster...

d <- NULL; c.j <- NULL; c.i <- NULL # set up again

system.time(for (i in c(2,4)) { # a series of nested iterations...

for (j in c(5:6)) {

d <-append( mean( as.numeric( analysis.func(vec,i,j) )) ,d)
# keeping it ugly for timing comparision...
c.j <- append(j, c.j)
c.i <- append(i, c.i)
}
})


#   user  system elapsed
#  0.299   0.002   0.299
#>  # arrgrgrgrgrg!!!

stopCluster(cl)
#[1] 1
sessionInfo()
#R version 2.7.1 (2008-06-23)
#i386-apple-darwin8.10.1
#
#locale:
#en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
#
#attached base packages:
#[1] stats graphics  grDevices utils datasets  methods   base
#
#other attached packages:
#[1] rlecuyer_0.1 boot_1.2-33  snow_0.3-3   Rmpi_0.5-5
#
#loaded via a namespace (and not attached):
#[1] tools_2.7.1
date()
#[1] "Sat Aug 23 04:25:50 2008"
#>
#Too late for a drink. Pity.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.