Re: [R] Median computation
Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
Assuming your original matrix IS a matrix, call it yourmat, and not a data frame (whose columns **must* have unique names if you haven't messed with the check.names default) then maybe: UNTESTED!!! ### thenames - unique(dimnames(yourmat)[[2]]) ans - lapply(thenames, function(nm, { apply( yourmat[, thenames==nm],1, median,na.rm=TRUE) }) If I got it right, ans should be a list of vectors, one per unique column name, each of which gives rowwise medians of the columns with the same name. This can be combined into a new matrix e.g. by do.call(cbind,ans) if you like. You could get a matrix answer directly if you use sapply or, maybe faster, vapply instead of lapply, but I find lists simpler to begin with. I believe this should be reasonably fast. Converting to and from data frames and operating on data frames slows things down a lot, because these are very general structures that must keep track of a lot of overhead when being worked on. Matrices do not. -- Bert On Wed, May 23, 2012 at 9:46 AM, Preeti pre...@sci.utah.edu wrote: Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
I wonder how you do this (or maybe on what kind of machine you execute it). I tried it out of curiosity and get df = as.data.frame(lapply(1:300,function(x)sample(200,25,T))) colnames(df) = sample(letters[1:20],300,T) system.time(dfmed-lapply(unique(colnames(df)), function(x) + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))) user system elapsed 5.680 0.952 7.171 and those times are in seconds! The time consuming part was building the data.frame not the calculation. The only thing I noticed is that my R process claims some 1.4 GB of memory but that should not be a problem on any recent hardware but my guess at answering your question would be that this might be your problem, especially if you have other memory-hogging variables like this data frame lying around and you see severe memory swapping effects Benno Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Benno Pütz Statistical Genetics MPI of Psychiatry Kraepelinstr. 2-10 80804 Munich, Germany T: ++49-(0)89-306 22 222 F: ++49-(0)89-306 22 601 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
Hmm.. that is interesting... I did this on our server machine which has about 200 cores. So memory is not an issue. Also, building the dataframe takes about a few minutes maximum for me. My code is similar to yours but for the fact that I create my dataframe from read.delim(filename) and then I drop the first column because it has characters. I don't know why it takes long on my machine. On Wed, May 23, 2012 at 11:26 AM, Benno Pütz pu...@mpipsykl.mpg.de wrote: I wonder how you do this (or maybe on what kind of machine you execute it). I tried it out of curiosity and get df = as.data.frame(lapply(1:300,function(x)sample(200,25,T))) colnames(df) = sample(letters[1:20],300,T) system.time(dfmed-lapply(unique(colnames(df)), function(x) + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))) user system elapsed 5.680 0.952 7.171 and those times are in seconds! The time consuming part was building the data.frame not the calculation. The only thing I noticed is that my R process claims some 1.4 GB of memory but that should not be a problem on any recent hardware but my guess at answering your question would be that this might be your problem, especially if you have other memory-hogging variables like this data frame lying around and you see severe memory swapping effects Benno Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Benno Pütz Statistical Genetics MPI of Psychiatry Kraepelinstr. 2-10 80804 Munich, Germany T: ++49-(0)89-306 22 222 F: ++49-(0)89-306 22 601 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
Just adding a few cents to this: rowMedians(x) is roughly 4-10 times faster than apply(x, MARGIN=1, FUN=median) - at least on my local Windows 7 64bit tests. You can do these simple benchmark runs yourself via the matrixStats/tests/rowMedians.R system test, cf. http://goo.gl/YCJed [R-forge]. /Henrik On Wed, May 23, 2012 at 10:30 AM, Preeti pre...@sci.utah.edu wrote: Hmm.. that is interesting... I did this on our server machine which has about 200 cores. So memory is not an issue. Also, building the dataframe takes about a few minutes maximum for me. My code is similar to yours but for the fact that I create my dataframe from read.delim(filename) and then I drop the first column because it has characters. I don't know why it takes long on my machine. On Wed, May 23, 2012 at 11:26 AM, Benno Pütz pu...@mpipsykl.mpg.de wrote: I wonder how you do this (or maybe on what kind of machine you execute it). I tried it out of curiosity and get df = as.data.frame(lapply(1:300,function(x)sample(200,25,T))) colnames(df) = sample(letters[1:20],300,T) system.time(dfmed-lapply(unique(colnames(df)), function(x) + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))) user system elapsed 5.680 0.952 7.171 and those times are in seconds! The time consuming part was building the data.frame not the calculation. The only thing I noticed is that my R process claims some 1.4 GB of memory but that should not be a problem on any recent hardware but my guess at answering your question would be that this might be your problem, especially if you have other memory-hogging variables like this data frame lying around and you see severe memory swapping effects Benno Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Benno Pütz Statistical Genetics MPI of Psychiatry Kraepelinstr. 2-10 80804 Munich, Germany T: ++49-(0)89-306 22 222 F: ++49-(0)89-306 22 601 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
On May 23, 2012, at 19:30 , Preeti wrote: Hmm.. that is interesting... I did this on our server machine which has about 200 cores. So memory is not an issue. Also, building the dataframe takes about a few minutes maximum for me. My code is similar to yours but for the fact that I create my dataframe from read.delim(filename) and then I drop the first column because it has characters. I don't know why it takes long on my machine. Are you sure that you actually have any columns with the same name then? You need read.delim(.., check.names=FALSE), otherwise you just get an expensive identity operation. Also, you should probably try running Benno's exact code, just for comparison. Some of those multicore machine are really rather slow if you only use one core for your process. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
On Wed, May 23, 2012 at 11:54 AM, peter dalgaard pda...@gmail.com wrote: On May 23, 2012, at 19:30 , Preeti wrote: Hmm.. that is interesting... I did this on our server machine which has about 200 cores. So memory is not an issue. Also, building the dataframe takes about a few minutes maximum for me. My code is similar to yours but for the fact that I create my dataframe from read.delim(filename) and then I drop the first column because it has characters. I don't know why it takes long on my machine. Are you sure that you actually have any columns with the same name then? Yes, That I am sure and yes that's how I read it. You need read.delim(.., check.names=FALSE), otherwise you just get an expensive identity operation. Also, you should probably try running Benno's exact code, just for comparison. Some of those multicore machine are really rather slow if you only use one core for your process. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
Yes, thanks Henrik. I neglected to mention that rowMedians could just be plugged in instead of apply (..,1,...) However, my main point is that that's probably not what matters,as Benno points out. Maybe it's the data frames instead of the matrices, but The process should execute in a few seconds even inefficiently (my code). So there's something fishy here. --Bert On Wed, May 23, 2012 at 10:39 AM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: Just adding a few cents to this: rowMedians(x) is roughly 4-10 times faster than apply(x, MARGIN=1, FUN=median) - at least on my local Windows 7 64bit tests. You can do these simple benchmark runs yourself via the matrixStats/tests/rowMedians.R system test, cf. http://goo.gl/YCJed [R-forge]. /Henrik On Wed, May 23, 2012 at 10:30 AM, Preeti pre...@sci.utah.edu wrote: Hmm.. that is interesting... I did this on our server machine which has about 200 cores. So memory is not an issue. Also, building the dataframe takes about a few minutes maximum for me. My code is similar to yours but for the fact that I create my dataframe from read.delim(filename) and then I drop the first column because it has characters. I don't know why it takes long on my machine. On Wed, May 23, 2012 at 11:26 AM, Benno Pütz pu...@mpipsykl.mpg.de wrote: I wonder how you do this (or maybe on what kind of machine you execute it). I tried it out of curiosity and get df = as.data.frame(lapply(1:300,function(x)sample(200,25,T))) colnames(df) = sample(letters[1:20],300,T) system.time(dfmed-lapply(unique(colnames(df)), function(x) + rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE))) user system elapsed 5.680 0.952 7.171 and those times are in seconds! The time consuming part was building the data.frame not the calculation. The only thing I noticed is that my R process claims some 1.4 GB of memory but that should not be a problem on any recent hardware but my guess at answering your question would be that this might be your problem, especially if you have other memory-hogging variables like this data frame lying around and you see severe memory swapping effects Benno Hello Everybody, The code: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) takes really long time to execute ( in hours). Is there a faster way to do this? Thanks! On Tue, May 22, 2012 at 3:46 PM, Preeti pre...@sci.utah.edu wrote: Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Benno Pütz Statistical Genetics MPI of Psychiatry Kraepelinstr. 2-10 80804 Munich, Germany T: ++49-(0)89-306 22 222 F: ++49-(0)89-306 22 601 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list
Re: [R] Median computation
On Tue, May 22, 2012 at 01:34:45PM -0600, Preeti wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Hi. Can you send a simple example of what you want to compute? The 300 medians of the 300 columns, each of length 250'000, may be computed using apply(x,2,median) and this does not take much time. What do you mean by choosing only those columns with same column name? Petr Savicky. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Median computation
Thanks Henrik! Here is the one-liner that I wrote: dfmed-lapply(unique(colnames(df)), function(x) rowMedians(as.matrix(df[,colnames(df) == x]),na.rm=TRUE)) Thanks again! On Tue, May 22, 2012 at 3:23 PM, Henrik Bengtsson h...@biostat.ucsf.eduwrote: See rowMedians() of the matrixStats package for replacing apply(x, MARGIN=1, FUN=median). /Henrik On Tue, May 22, 2012 at 12:34 PM, Preeti pre...@sci.utah.edu wrote: Hi, I have a 250,000 by 300 matrix. I am trying to calculate the median of those columns (by row) with column names that are identical. I would like this to be efficient since apply(x,1,median) where x is created by choosing only those columns with same column name and looping on this is taking a really long time. Is there an efficient way to do this? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.