[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame "df": > col1=sample(c(0,1),10, rep=T) > names = factor(c(rep("A",5),rep("B",5))) > df = data.frame(names,col1) > df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: > index = c("A","B") > col1[[1]]=df$col1[which(df$name=="A")] > col1[[2]]=df$col1[which(df$name=="B")] My problem is that the command: *** which(df$name=="A") *** takes about 1 second because df is so big. I was thinking that a "level" could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: > Dear All, > > I have a large data frame ( 270 lines and 14 columns), and I would like to > extract the information in a particular way illustrated below: > > > Given a data frame "df": > >> col1=sample(c(0,1),10, rep=T) >> names = factor(c(rep("A",5),rep("B",5))) >> df = data.frame(names,col1) >> df > names col1 > 1 A1 > 2 A0 > 3 A1 > 4 A0 > 5 A1 > 6 B0 > 7 B0 > 8 B1 > 9 B0 > 10 B0 > > I would like to tranform it in the form: > >> index = c("A","B") >> col1[[1]]=df$col1[which(df$name=="A")] >> col1[[2]]=df$col1[which(df$name=="B")] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. > n <- 270 > foo <- data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + ) > system.time(out <- which(foo$two=="A")) user system elapsed 0.566 0.146 0.761 > system.time(out <- foo$two=="A") user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. > system.time(out <- unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter > My problem is that the command: *** which(df$name=="A") *** > takes about 1 second because df is so big. > > I was thinking that a "level" could maybe be accessed instantly but I am not > sure about how to do it. > > I would be very grateful for any advice that would allow me to speed this up. > > Best wishes, > > Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
To simplify: n <- 2.7e6; x <- factor(c(rep("A", n/2), rep("B", n/2))); # Identify 'A':s t1 <- system.time(res <- which(x == "A")); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 <- system.time(res <- which(as.character(x) == "A")); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 <- system.time(res <- which(x == match("A", levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: > Emmanuel, > > On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >> Dear All, >> >> I have a large data frame ( 270 lines and 14 columns), and I would like >> to >> extract the information in a particular way illustrated below: >> >> >> Given a data frame "df": >> >>> col1=sample(c(0,1),10, rep=T) >>> names = factor(c(rep("A",5),rep("B",5))) >>> df = data.frame(names,col1) >>> df >> names col1 >> 1 A1 >> 2 A0 >> 3 A1 >> 4 A0 >> 5 A1 >> 6 B0 >> 7 B0 >> 8 B1 >> 9 B0 >> 10 B0 >> >> I would like to tranform it in the form: >> >>> index = c("A","B") >>> col1[[1]]=df$col1[which(df$name=="A")] >>> col1[[2]]=df$col1[which(df$name=="B")] > > I'm not sure I fully understand your problem, you example would not run for > me. > > You could get a small speedup by omitting which(), you can subset by a > logical vector also which give a small speedup. > >> n <- 270 >> foo <- data.frame( > + one = sample(c(0,1), n, rep = T), > + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) > + ) >> system.time(out <- which(foo$two=="A")) > user system elapsed > 0.566 0.146 0.761 >> system.time(out <- foo$two=="A") > user system elapsed > 0.429 0.075 0.588 > > You might also find use for unstack(), though I didn't see a speedup. >> system.time(out <- unstack(foo)) > user system elapsed > 1.068 0.697 2.004 > > HTH > > Peter > >> My problem is that the command: *** which(df$name=="A") *** >> takes about 1 second because df is so big. >> >> I was thinking that a "level" could maybe be accessed instantly but I am not >> sure about how to do it. >> >> I would be very grateful for any advice that would allow me to speed this up. >> >> Best wishes, >> >> Emmanuel > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a "hash" key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name "1" L[[2]] = values of name "2" L[[3]] = values of name "3" ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: > To simplify: > > n <- 2.7e6; > x <- factor(c(rep("A", n/2), rep("B", n/2))); > > # Identify 'A':s > t1 <- system.time(res <- which(x == "A")); > > # To compare a factor to a string, the factor is in practice > # coerced to a character vector. > t2 <- system.time(res <- which(as.character(x) == "A")); > > # Interestingly enough, this seems to be faster (repeated many times) > # Don't know why. > print(t2/t1); >user system elapsed > 0.632653 1.60 0.754717 > > # Avoid coercing the factor, but instead coerce the level compared to > t3 <- system.time(res <- which(x == match("A", levels(x; > > # ...but gives no speed up > print(t3/t1); >user system elapsed > 1.041667 1.00 1.018182 > > # But coercing the factor to integers does > t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x > print(t4/t1); > usersystem elapsed > 0.417 0.000 0.3636364 > > So, the latter seems to be the fastest way to identify those elements. > > My $.02 > > /Henrik > > > On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: >> Emmanuel, >> >> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >>> Dear All, >>> >>> I have a large data frame ( 270 lines and 14 columns), and I would like >>> to >>> extract the information in a particular way illustrated below: >>> >>> >>> Given a data frame "df": >>> col1=sample(c(0,1),10, rep=T) names = factor(c(rep("A",5),rep("B",5))) df = data.frame(names,col1) df >>> names col1 >>> 1 A1 >>> 2 A0 >>> 3 A1 >>> 4 A0 >>> 5 A1 >>> 6 B0 >>> 7 B0 >>> 8 B1 >>> 9 B0 >>> 10 B0 >>> >>> I would like to tranform it in the form: >>> index = c("A","B") col1[[1]]=df$col1[which(df$name=="A")] col1[[2]]=df$col1[which(df$name=="B")] >> >> I'm not sure I fully understand your problem, you example would not run for >> me. >> >> You could get a small speedup by omitting which(), you can subset by a >> logical vector also which give a small speedup. >> >>> n <- 270 >>> foo <- data.frame( >> + one = sample(c(0,1), n, rep = T), >> + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) >> + ) >>> system.time(out <- which(foo$two=="A")) >> user system elapsed >> 0.566 0.146 0.761 >>> system.time(out <- foo$two=="A") >> user system elapsed >> 0.429 0.075 0.588 >> >> You might also find use for unstack(), though I didn't see a speedup. >>> system.time(out <- unstack(foo)) >> user system elapsed >> 1.068 0.697 2.004 >> >> HTH >> >> Peter >> >>> My problem is that the command: *** which(df$name=="A") *** >>> takes about 1 second because df is so big. >>> >>> I was thinking that a "level" could maybe be accessed instantly but I am not >>> sure about how to do it. >>> >>> I would be very grateful for any advice that would allow me to speed this >>> up. >>> >>> Best wishes, >>> >>> Emmanuel >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a "hash" key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name "1" L[[2]] = values of name "2" L[[3]] = values of name "3" ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: To simplify: n <- 2.7e6; x <- factor(c(rep("A", n/2), rep("B", n/2))); # Identify 'A':s t1 <- system.time(res <- which(x == "A")); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 <- system.time(res <- which(as.character(x) == "A")); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 <- system.time(res <- which(x == match("A", levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame "df": col1=sample(c(0,1),10, rep=T) names = factor(c(rep("A",5),rep("B",5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c("A","B") col1[[1]]=df$col1[which(df$name=="A")] col1[[2]]=df$col1[which(df$name=="B")] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n <- 270 foo <- data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + ) system.time(out <- which(foo$two=="A")) user system elapsed 0.566 0.146 0.761 system.time(out <- foo$two=="A") user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out <- unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name=="A") *** takes about 1 second because df is so big. I was thinking that a "level" could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Sorry for being unclear, I thought the example above was clear enough. I have a data frame of the form: name info 1 YAL001C 1 2 YAL001C 1 3 YAL001C 1 4 YAL001C 1 5 YAL001C 0 6 YAL001C 1 7 YAL001C 1 8 YAL001C 1 9 YAL001C 1 10 YAL001C 1 ... ... ~270 lines, and ~6000 different names. which corresponds to yeast proteins + some info. So there are about 6000 names like "YAL001C" I would like to transform this data frame into the following form: 1/ a list, where each protein corresponds to an index, and the info is the vector > L[[1]] [1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 > L[[2]] [1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 etc. 2/ an index, which gives me the position of each protein in the list: > index [1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" ... I hope this will be clearer! I'll have a look right now that the split and hash.mat functions. Thanks for your help, Emmanuel 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: > I still don't understand what you are doing. Can you make a small example > that shows what you have and what you want? > > Is ?split what you are after? > > Emmanuel Levy wrote: >> >> Dear Peter and Henrik, >> >> Thanks for your replies - this helps speed up a bit, but I thought >> there would be something much faster. >> >> What I mean is that I thought that a particular value of a level >> could be accessed instantly, similarly to a "hash" key. >> >> Since I've got about 6000 levels in that data frame, it means that >> making a list L of the form >> L[[1]] = values of name "1" >> L[[2]] = values of name "2" >> L[[3]] = values of name "3" >> ... >> would take ~1hour. >> >> Best, >> >> Emmanuel >> >> >> >> >> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: >>> >>> To simplify: >>> >>> n <- 2.7e6; >>> x <- factor(c(rep("A", n/2), rep("B", n/2))); >>> >>> # Identify 'A':s >>> t1 <- system.time(res <- which(x == "A")); >>> >>> # To compare a factor to a string, the factor is in practice >>> # coerced to a character vector. >>> t2 <- system.time(res <- which(as.character(x) == "A")); >>> >>> # Interestingly enough, this seems to be faster (repeated many times) >>> # Don't know why. >>> print(t2/t1); >>> user system elapsed >>> 0.632653 1.60 0.754717 >>> >>> # Avoid coercing the factor, but instead coerce the level compared to >>> t3 <- system.time(res <- which(x == match("A", levels(x; >>> >>> # ...but gives no speed up >>> print(t3/t1); >>> user system elapsed >>> 1.041667 1.00 1.018182 >>> >>> # But coercing the factor to integers does >>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x >>> print(t4/t1); >>>usersystem elapsed >>> 0.417 0.000 0.3636364 >>> >>> So, the latter seems to be the fastest way to identify those elements. >>> >>> My $.02 >>> >>> /Henrik >>> >>> >>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: > > Dear All, > > I have a large data frame ( 270 lines and 14 columns), and I would > like to > extract the information in a particular way illustrated below: > > > Given a data frame "df": > >> col1=sample(c(0,1),10, rep=T) >> names = factor(c(rep("A",5),rep("B",5))) >> df = data.frame(names,col1) >> df > > names col1 > 1 A1 > 2 A0 > 3 A1 > 4 A0 > 5 A1 > 6 B0 > 7 B0 > 8 B1 > 9 B0 > 10 B0 > > I would like to tranform it in the form: > >> index = c("A","B") >> col1[[1]]=df$col1[which(df$name=="A")] >> col1[[2]]=df$col1[which(df$name=="B")] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. > n <- 270 > foo <- data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + ) > > system.time(out <- which(foo$two=="A")) user system elapsed 0.566 0.146 0.761 > > system.time(out <- foo$two=="A") user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. > > system.time(out <- unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter > My problem is that the command: *** which(df$name=="A") *** > takes about 1 second because df is so big. > > I was thinking that a "level" could maybe be accessed instantly but I > am not > sure about how to do it. > > I would be very grateful for any advice that would all
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D Thanks again - I can't believe I never used this function in the past. All the best, Emmanuel 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: > I still don't understand what you are doing. Can you make a small example > that shows what you have and what you want? > > Is ?split what you are after? > > Emmanuel Levy wrote: >> >> Dear Peter and Henrik, >> >> Thanks for your replies - this helps speed up a bit, but I thought >> there would be something much faster. >> >> What I mean is that I thought that a particular value of a level >> could be accessed instantly, similarly to a "hash" key. >> >> Since I've got about 6000 levels in that data frame, it means that >> making a list L of the form >> L[[1]] = values of name "1" >> L[[2]] = values of name "2" >> L[[3]] = values of name "3" >> ... >> would take ~1hour. >> >> Best, >> >> Emmanuel >> >> >> >> >> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: >>> >>> To simplify: >>> >>> n <- 2.7e6; >>> x <- factor(c(rep("A", n/2), rep("B", n/2))); >>> >>> # Identify 'A':s >>> t1 <- system.time(res <- which(x == "A")); >>> >>> # To compare a factor to a string, the factor is in practice >>> # coerced to a character vector. >>> t2 <- system.time(res <- which(as.character(x) == "A")); >>> >>> # Interestingly enough, this seems to be faster (repeated many times) >>> # Don't know why. >>> print(t2/t1); >>> user system elapsed >>> 0.632653 1.60 0.754717 >>> >>> # Avoid coercing the factor, but instead coerce the level compared to >>> t3 <- system.time(res <- which(x == match("A", levels(x; >>> >>> # ...but gives no speed up >>> print(t3/t1); >>> user system elapsed >>> 1.041667 1.00 1.018182 >>> >>> # But coercing the factor to integers does >>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x >>> print(t4/t1); >>>usersystem elapsed >>> 0.417 0.000 0.3636364 >>> >>> So, the latter seems to be the fastest way to identify those elements. >>> >>> My $.02 >>> >>> /Henrik >>> >>> >>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: > > Dear All, > > I have a large data frame ( 270 lines and 14 columns), and I would > like to > extract the information in a particular way illustrated below: > > > Given a data frame "df": > >> col1=sample(c(0,1),10, rep=T) >> names = factor(c(rep("A",5),rep("B",5))) >> df = data.frame(names,col1) >> df > > names col1 > 1 A1 > 2 A0 > 3 A1 > 4 A0 > 5 A1 > 6 B0 > 7 B0 > 8 B1 > 9 B0 > 10 B0 > > I would like to tranform it in the form: > >> index = c("A","B") >> col1[[1]]=df$col1[which(df$name=="A")] >> col1[[2]]=df$col1[which(df$name=="B")] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. > n <- 270 > foo <- data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + ) > > system.time(out <- which(foo$two=="A")) user system elapsed 0.566 0.146 0.761 > > system.time(out <- foo$two=="A") user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. > > system.time(out <- unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter > My problem is that the command: *** which(df$name=="A") *** > takes about 1 second because df is so big. > > I was thinking that a "level" could maybe be accessed instantly but I > am not > sure about how to do it. > > I would be very grateful for any advice that would allow me to speed > this up. > > Best wishes, > > Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/l
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
split if probably what you are after. Here is an example: > n <- 270 > x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) > # split it into 6000 lists > system.time(y <- split(x$value, x$name)) user system elapsed 0.800.201.07 > str(y[1:10]) List of 10 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... > Takes less that 1 second to split into 6000 lists. On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: > Wow great! Split was exactly what was needed. It takes about 1 second > for the whole operation :D > > Thanks again - I can't believe I never used this function in the past. > > All the best, > > Emmanuel > > > 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: >> I still don't understand what you are doing. Can you make a small example >> that shows what you have and what you want? >> >> Is ?split what you are after? >> >> Emmanuel Levy wrote: >>> >>> Dear Peter and Henrik, >>> >>> Thanks for your replies - this helps speed up a bit, but I thought >>> there would be something much faster. >>> >>> What I mean is that I thought that a particular value of a level >>> could be accessed instantly, similarly to a "hash" key. >>> >>> Since I've got about 6000 levels in that data frame, it means that >>> making a list L of the form >>> L[[1]] = values of name "1" >>> L[[2]] = values of name "2" >>> L[[3]] = values of name "3" >>> ... >>> would take ~1hour. >>> >>> Best, >>> >>> Emmanuel >>> >>> >>> >>> >>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: To simplify: n <- 2.7e6; x <- factor(c(rep("A", n/2), rep("B", n/2))); # Identify 'A':s t1 <- system.time(res <- which(x == "A")); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 <- system.time(res <- which(as.character(x) == "A")); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 <- system.time(res <- which(x == match("A", levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: > > Emmanuel, > > On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> > wrote: >> >> Dear All, >> >> I have a large data frame ( 270 lines and 14 columns), and I would >> like to >> extract the information in a particular way illustrated below: >> >> >> Given a data frame "df": >> >>> col1=sample(c(0,1),10, rep=T) >>> names = factor(c(rep("A",5),rep("B",5))) >>> df = data.frame(names,col1) >>> df >> >> names col1 >> 1 A1 >> 2 A0 >> 3 A1 >> 4 A0 >> 5 A1 >> 6 B0 >> 7 B0 >> 8 B1 >> 9 B0 >> 10 B0 >> >> I would like to tranform it in the form: >> >>> index = c("A","B") >>> col1[[1]]=df$col1[which(df$name=="A")] >>> col1[[2]]=df$col1[which(df$name=="B")] > > I'm not sure I fully understand your problem, you example would not run > for me. > > You could get a small speedup by omitting which(), you can subset by a > logical vector also which give a small speedup. > >> n <- 270 >> foo <- data.frame( > > + one = sample(c(0,1), n, rep = T), > + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) > + ) >> >> system.time(out <- which(foo$two=="A")) > > user system elapsed > 0.566 0.146 0.761 >> >> system.time(out <- foo$two=="A") > > user system elapsed > 0.429 0.075 0.588 > > You might also find use for unstack(), though I didn't see a speedup. >> >> system.time(out <- unstack(foo)) > > user system elapsed > 1.068 0.697 2.004 > > HTH >
Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
If you want the index, then use: > system.time(y <- split(seq(nrow(x)), x$name)) user system elapsed 0.810.060.88 > str(y[1:10]) List of 10 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624 65767 75359 ... $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134 64414 64491 ... $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ... $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769 81783 83951 ... $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ... $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791 69855 74272 ... $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769 88878 90594 ... $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725 65243 73260 ... $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680 35828 50490 ... $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709 55505 70957 ... > > On Wed, Aug 13, 2008 at 9:09 AM, jim holtman <[EMAIL PROTECTED]> wrote: > split if probably what you are after. Here is an example: > >> n <- 270 >> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) >> # split it into 6000 lists >> system.time(y <- split(x$value, x$name)) > user system elapsed > 0.800.201.07 >> str(y[1:10]) > List of 10 > $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... > $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... > $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... > $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... > $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... > $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... > $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... > $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... > $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... > $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... >> > Takes less that 1 second to split into 6000 lists. > > On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >> Wow great! Split was exactly what was needed. It takes about 1 second >> for the whole operation :D >> >> Thanks again - I can't believe I never used this function in the past. >> >> All the best, >> >> Emmanuel >> >> >> 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: >>> I still don't understand what you are doing. Can you make a small example >>> that shows what you have and what you want? >>> >>> Is ?split what you are after? >>> >>> Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a "hash" key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name "1" L[[2]] = values of name "2" L[[3]] = values of name "3" ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: > > To simplify: > > n <- 2.7e6; > x <- factor(c(rep("A", n/2), rep("B", n/2))); > > # Identify 'A':s > t1 <- system.time(res <- which(x == "A")); > > # To compare a factor to a string, the factor is in practice > # coerced to a character vector. > t2 <- system.time(res <- which(as.character(x) == "A")); > > # Interestingly enough, this seems to be faster (repeated many times) > # Don't know why. > print(t2/t1); > user system elapsed > 0.632653 1.60 0.754717 > > # Avoid coercing the factor, but instead coerce the level compared to > t3 <- system.time(res <- which(x == match("A", levels(x; > > # ...but gives no speed up > print(t3/t1); > user system elapsed > 1.041667 1.00 1.018182 > > # But coercing the factor to integers does > t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x > print(t4/t1); >usersystem elapsed > 0.417 0.000 0.3636364 > > So, the latter seems to be the fastest way to identify those elements. > > My $.02 > > /Henrik > > > On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: >> >> Emmanuel, >> >> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> >> wrote: >>> >>> Dear All, >>> >>> I have a large data frame ( 270 lines and 14 columns), and I would >>> like to >>> extract the information in a particular way illustrated below: >>> >>> >>> Given a data frame "df": >>> col1=sample(c(0,1),10, rep=T) names = factor(c(rep("A",5),rep("B",5))) df = data.frame(names,col1) df >>> >>> names col1 >>> 1 A1 >>> 2 A0 >>> 3 A