Re: [R] Replacing for loop with tapply!?
year. Unfortunately, the code breaks down (when uncommenting mat-NA). I have tried 'ifelse' statements in the functions, but it becomes even more of a mess. I could subset the matrix before hand, but this would mean merging with a complete matrix afterwards to make it compatible with other years. That would slow things down. How can I make the code robust for rows containing all missing values? Thanks for your help, Sander. Dimitris Rizopoulos wrote: for the maximum you could use something like: ind[, 1] - apply(mat, 2, max) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: Dimitris Rizopoulos [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 12:10 PM Subject: Re: [R] Replacing for loop with tapply!? Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6 0 55 67 74 7 0 62 77 93 8 0 45 57 67 9 0 57 68 75 10 0 61 66 76 However, I have not managed to get the row maximum using your method? It should be 50 for most rows, but my first guess code gives an error! Any suggestions? Sander Dimitris Rizopoulos wrote: maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
Hi On 10 Jun 2005 at 20:05, Sander Oom wrote: Dear all, Dimitris and Andy, thanks for your great help. I have progressed to the following code which runs very fast and effective: mat - matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[mat45] - NA mat-NA By this you redefine mat as str(mat) logi NA and your code gives an error that it has to have some dimensions + apply(mat, 1, max, na.rm=TRUE)) Error in rowSums(mat temp, na.rm = TRUE) : 'x' must be an array of at least two dimensions If your matrix has one row full of NA's it only complains but computes a value. mat[3,]-NA temps - c(35, 37, 39) ind - rbind( + t(sapply(temps, function(temp) +rowSums(mat temp, na.rm=TRUE) )), + rowSums(!is.na(mat), na.rm=FALSE), + apply(mat, 1, max, na.rm=TRUE)) Warning message: no finite arguments to max; returning -Inf ind - t(ind) ind ind [,1] [,2] [,3] [,4] [,5] [1,]5539 48 [2,]1119 42 [3,]0000 -Inf mat temps - c(35, 37, 39) ind - rbind( t(sapply(temps, function(temp) rowSums(mat temp, na.rm=TRUE) )), rowSums(!is.na(mat), na.rm=FALSE), apply(mat, 1, max, na.rm=TRUE)) ind - t(ind) ind However, some weather stations have missing values for the whole year. Unfortunately, the code breaks down (when uncommenting mat-NA). I have tried 'ifelse' statements in the functions, but it becomes even more of a mess. I could subset the matrix before hand, but this would mean merging with a complete matrix afterwards to make it compatible with other years. That would slow things down. How can I make the code robust for rows containing all missing values? which(rowSums(!is.na(mat))==0) This gives you indices which lines of your matrix has all values NA and you can use it for fine tuning of your code. What you need to do depends on what results do you want, how ind matrix should look like after processing mat with one or more rows full of NA's. HTH Petr Thanks for your help, Sander. Dimitris Rizopoulos wrote: for the maximum you could use something like: ind[, 1] - apply(mat, 2, max) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: Dimitris Rizopoulos [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 12:10 PM Subject: Re: [R] Replacing for loop with tapply!? Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6 0 55 67 74 7 0 62 77 93 8 0 45 57 67 9 0 57 68 75 10 0 61 66 76 However, I have not managed to get the row maximum using your method? It should be 50 for most rows, but my first guess code gives an error! Any suggestions? Sander Dimitris Rizopoulos wrote: maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each
Re: [R] Replacing for loop with tapply!?
maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. -- Dr Sander P. Oom Animal, Plant and Environmental Sciences, University of the Witwatersrand Private Bag 3, Wits 2050, South Africa Tel (work) +27 (0)11 717 64 04 Tel (home) +27 (0)18 297 44 51 Fax +27 (0)18 299 24 64 Email [EMAIL PROTECTED] Web www.oomvanlieshout.net/sander __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6 0 55 67 74 7 0 62 77 93 8 0 45 57 67 9 0 57 68 75 10 0 61 66 76 However, I have not managed to get the row maximum using your method? It should be 50 for most rows, but my first guess code gives an error! Any suggestions? Sander Dimitris Rizopoulos wrote: maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. -- Dr Sander P. Oom Animal, Plant and Environmental Sciences, University of the Witwatersrand Private Bag 3, Wits 2050, South Africa Tel (work) +27 (0)11 717 64 04 Tel (home) +27 (0)18 297 44 51 Fax +27 (0)18 299 24 64 Email [EMAIL PROTECTED] Web www.oomvanlieshout.net/sander __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Dr Sander P. Oom Animal, Plant and Environmental Sciences, University of the Witwatersrand Private Bag 3, Wits 2050, South Africa Tel (work) +27 (0)11 717 64 04 Tel (home) +27 (0)18 297 44 51 Fax +27 (0)18 299 24 64 Email [EMAIL PROTECTED] Web www.oomvanlieshout.net/sander __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
for the maximum you could use something like: ind[, 1] - apply(mat, 2, max) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: Dimitris Rizopoulos [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 12:10 PM Subject: Re: [R] Replacing for loop with tapply!? Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6 0 55 67 74 7 0 62 77 93 8 0 45 57 67 9 0 57 68 75 10 0 61 66 76 However, I have not managed to get the row maximum using your method? It should be 50 for most rows, but my first guess code gives an error! Any suggestions? Sander Dimitris Rizopoulos wrote: maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. -- Dr Sander P. Oom Animal, Plant and Environmental Sciences, University of the Witwatersrand Private Bag 3, Wits 2050, South Africa Tel (work) +27 (0)11 717 64 04 Tel (home) +27 (0)18 297 44 51 Fax +27 (0)18 299 24 64 Email [EMAIL PROTECTED] Web www.oomvanlieshout.net/sander __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Dr Sander P. Oom Animal, Plant and Environmental Sciences, University of the Witwatersrand Private Bag 3, Wits 2050, South Africa Tel (work) +27 (0)11 717 64 04 Tel (home) +27 (0)18 297 44 51 Fax +27 (0)18 299 24 64 Email [EMAIL PROTECTED] Web www.oomvanlieshout.net/sander __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
Dear all, Dimitris and Andy, thanks for your great help. I have progressed to the following code which runs very fast and effective: mat - matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[mat45] - NA mat-NA mat temps - c(35, 37, 39) ind - rbind( t(sapply(temps, function(temp) rowSums(mat temp, na.rm=TRUE) )), rowSums(!is.na(mat), na.rm=FALSE), apply(mat, 1, max, na.rm=TRUE)) ind - t(ind) ind However, some weather stations have missing values for the whole year. Unfortunately, the code breaks down (when uncommenting mat-NA). I have tried 'ifelse' statements in the functions, but it becomes even more of a mess. I could subset the matrix before hand, but this would mean merging with a complete matrix afterwards to make it compatible with other years. That would slow things down. How can I make the code robust for rows containing all missing values? Thanks for your help, Sander. Dimitris Rizopoulos wrote: for the maximum you could use something like: ind[, 1] - apply(mat, 2, max) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: Dimitris Rizopoulos [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 12:10 PM Subject: Re: [R] Replacing for loop with tapply!? Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6 0 55 67 74 7 0 62 77 93 8 0 45 57 67 9 0 57 68 75 10 0 61 66 76 However, I have not managed to get the row maximum using your method? It should be 50 for most rows, but my first guess code gives an error! Any suggestions? Sander Dimitris Rizopoulos wrote: maybe you are looking for something along these lines: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # ind - matrix(0, length(temps), ncol(mat)) for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 10:50 AM Subject: [R] Replacing for loop with tapply!? Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
Sander Oom wrote: Dear all, We have a large data set with temperature data for weather stations across the globe (15000 stations). For each station, we need to calculate the number of days a certain temperature is exceeded. So far we used the following S code, where mat88 is a matrix containing rows of 365 daily temperatures for each of 15000 weather stations: m - 37 n - 2 outmat88 - matrix(0, ncol = 4, nrow = nrow(mat88)) for(i in 1:nrow(mat88)) { # i - 3 row1 - as.data.frame(df88[i, ]) temprow37 - select.rows(row1, row1 m) temprow39 - select.rows(row1, row1 m + n) temprow41 - select.rows(row1, row1 m + 2 * n) outmat88[i, 1] - max(row1, na.rm = T) outmat88[i, 2] - count.rows(temprow37) outmat88[i, 3] - count.rows(temprow39) outmat88[i, 4] - count.rows(temprow41) } outmat88 What you need is not tapply but apply. Something like apply(mat88, 1, function(x) sum(x 30)) where your treshold should replace 30 and the `1' refers to rows. For multiple tresholds: apply(mat88, 1, function(x) c( sum(x20), sum(x25), sum(x30))) Kjetil We have transferred the data to a more potent Linux box running R, but still hope to speed up the code. I know a for loop should be avoided when looking for speed. I also know the answer is in something like tapply, but my understanding of these commands is still to limited to see the solution. Could someone show me the way!? Thanks in advance, Sander. -- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra -- No virus found in this outgoing message. Checked by AVG Anti-Virus. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Replacing for loop with tapply!?
OK, so you want to find some summary statistics for each column, where some columns could be completely missing. Writing a small wrapper should help. When you use apply(), you are actually applying a function to every column (or row). First, let us simulate a dataset with 15 days/rows and 10 stations/columns ### simulate data set.seed(1)# for reproducibility mat - matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[ mat 45 ] - NA # create some missing values mat[ ,9 ] - NA # station 9's data is completely missing Here are two example of such wrappers : find.stats1 - function( data, threshold=c(37,39,41) ){ n - length(threshold) out - matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise out[1, ] - apply(data, 2, function(x) ifelse( all(is.na(x)), NA, max(x, na.rm=T) )) for(i in 1:n) out[ i+1, ] - colSums( data threshold[i], na.rm=T ) rownames(out) - c( daily_max, paste(above, threshold, sep=_) ) colnames(out) - rownames(data) # name of the stations return( out ) } find.stats2 - function( data, threshold=c(37,39,41) ){ n - length(threshold) excess - numeric( n ) out- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise good - which( apply( data, 2, function(x) !all(is.na(x)) ) ) # colums that are not completely missing out[ , good] - apply( data[ , good], 2, function(x){ m - max( x, na.rm=T ) for(i in 1:n){ excess[i] - sum( x threshold[i], na.rm=TRUE ) } return( c(m, excess) ) } ) rownames(out) - c( daily_max, paste(above, threshold, sep=_) ) colnames(out) - rownames(data) # name of the stations return( out ) } find.stats1( mat ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] daily_max 44 42 39 41 45 43 42 45 NA42 above_37 212132210 1 above_39 210132110 1 above_41 210022110 1 find.stats2( mat ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] daily_max 44 42 39 41 45 43 42 45 NA42 above_37 21213221 NA 1 above_39 21013211 NA 1 above_41 21002211 NA 1 On my laptop 'find.stats1' and 'find.stats2' (which is more flexible) takes 7 and 6 seconds respectively to execute on a dataset with 1 stations and 365 days. Regards, Adai On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote: Dear all, Dimitris and Andy, thanks for your great help. I have progressed to the following code which runs very fast and effective: mat - matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[mat45] - NA mat-NA mat temps - c(35, 37, 39) ind - rbind( t(sapply(temps, function(temp) rowSums(mat temp, na.rm=TRUE) )), rowSums(!is.na(mat), na.rm=FALSE), apply(mat, 1, max, na.rm=TRUE)) ind - t(ind) ind However, some weather stations have missing values for the whole year. Unfortunately, the code breaks down (when uncommenting mat-NA). I have tried 'ifelse' statements in the functions, but it becomes even more of a mess. I could subset the matrix before hand, but this would mean merging with a complete matrix afterwards to make it compatible with other years. That would slow things down. How can I make the code robust for rows containing all missing values? Thanks for your help, Sander. Dimitris Rizopoulos wrote: for the maximum you could use something like: ind[, 1] - apply(mat, 2, max) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/336899 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm - Original Message - From: Sander Oom [EMAIL PROTECTED] To: Dimitris Rizopoulos [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Sent: Friday, June 10, 2005 12:10 PM Subject: Re: [R] Replacing for loop with tapply!? Thanks Dimitris, Very impressive! Much faster than before. Thanks to new found R.basic, I can simply rotate the result with rotate270{R.basic}: mat - matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) temps - c(37, 39, 41) # #ind - matrix(0, length(temps), ncol(mat)) ind - matrix(0, 4, ncol(mat)) (startDate - date()) [1] Fri Jun 10 12:08:01 2005 for(i in seq(along = temps)) ind[i, ] - colSums(mat temps[i]) ind[4, ] - colMeans(max(mat)) Error in colMeans(max(mat)) : 'x' must be an array of at least two dimensions (endDate - date()) [1] Fri Jun 10 12:08:02 2005 ind - rotate270(ind) ind[1:10,] V4 V3 V2 V1 1 0 56 75 80 2 0 46 53 60 3 0 50 58 67 4 0 60 72 80 5 0 59 68 76 6