In my absentmindedness I'd forgotten to CC this to the list... and BTW, using gc() in the loop increases the runtime...
>> My suggestion is that you try to vectorize the computation as much >> as you >> can. >> >> From what you've shown, `new' and `ped' need to have the same >> number of >> rows, right? >> >> Your `off' function seems to be randomly choosing between columns >> 1 and 2 >> from its two input matrices (one row each?). You may want to do the >> sampling all at once instead of looping over the rows. E.g., >> >> >> >>> (m <- matrix(1:10, ncol=2)) >>> >>> >> [,1] [,2] >> [1,] 1 6 >> [2,] 2 7 >> [3,] 3 8 >> [4,] 4 9 >> [5,] 5 10 >> >> >>> (colSample <- sample(1:2, nrow(m), replace=TRUE)) >>> >>> >> [1] 1 1 2 1 1 >> >> >>> (x <- m[cbind(1:nrow(m), colSample)]) >>> >>> >> [1] 1 2 8 4 5 >> >> So you might want to do something like (obviously untested): >> >> todo <- ped[,3] * ped[,5] != 0 ## indicator of which rows to work on >> n.todo <- sum(todo) ## how many are there? >> sire <- new[ped[todo, 3], ] >> dam <- new[ped[todo, 5], ] >> s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)] >> d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)] >> new[todo, 1:2] <- cbind(s.gam, d.gam) >> >> > > Improving the efficiency of the code is abviously a plus, but the > real thing I am mesmerised by is the sheer increase in runtime... > how come not a linear increase with dataset size? > > Cheers, > > Federico > > -- > Federico C. F. Calboli > Department of Epidemiology and Public Health > Imperial College, St. Mary's Campus > Norfolk Place, London W2 1PG > > Tel +44 (0)20 75941602 Fax +44 (0)20 75943193 > > f.calboli [.a.t] imperial.ac.uk > f.calboli [.a.t] gmail.com > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
