Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Sorry for being unclear, I thought the example above was clear enough. I have a data frame of the form: name info 1 YAL001C 1 2 YAL001C 1 3 YAL001C 1 4 YAL001C 1 5 YAL001C 0 6 YAL001C 1 7 YAL001C 1 8 YAL001C 1 9 YAL001C 1 10 YAL001C 1 ... ... ~270 lines, and ~6000 different names. which corresponds to yeast proteins + some info. So there are about 6000 names like YAL001C I would like to transform this data frame into the following form: 1/ a list, where each protein corresponds to an index, and the info is the vector L[[1]] [1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 L[[2]] [1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 etc. 2/ an index, which gives me the position of each protein in the list: index [1] YAL001C YAL002W YAL003W YAL005C YAL007C ... I hope this will be clearer! I'll have a look right now that the split and hash.mat functions. Thanks for your help, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D Thanks again - I can't believe I never used this function in the past. All the best, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
split if probably what you are after. Here is an example: n - 270 x - data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) # split it into 6000 lists system.time(y - split(x$value, x$name)) user system elapsed 0.800.201.07 str(y[1:10]) List of 10 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... Takes less that 1 second to split into 6000 lists. On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy [EMAIL PROTECTED] wrote: Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D Thanks again - I can't believe I never used this function in the past. All the best, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
If you want the index, then use: system.time(y - split(seq(nrow(x)), x$name)) user system elapsed 0.810.060.88 str(y[1:10]) List of 10 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624 65767 75359 ... $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134 64414 64491 ... $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ... $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769 81783 83951 ... $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ... $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791 69855 74272 ... $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769 88878 90594 ... $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725 65243 73260 ... $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680 35828 50490 ... $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709 55505 70957 ... On Wed, Aug 13, 2008 at 9:09 AM, jim holtman [EMAIL PROTECTED] wrote: split if probably what you are after. Here is an example: n - 270 x - data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) # split it into 6000 lists system.time(y - split(x$value, x$name)) user system elapsed 0.800.201.07 str(y[1:10]) List of 10 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... Takes less that 1 second to split into 6000 lists. On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy [EMAIL PROTECTED] wrote: Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D Thanks again - I can't believe I never used this function in the past. All the best, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + )
[R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.