Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a hash key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name 1
L[[2]] = values of name 2
L[[3]] = values of name 3
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:
 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
 usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:
 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would like 
 to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run for 
 me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(
 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )
 system.time(out - which(foo$two==A))
   user  system elapsed
  0.566   0.146   0.761
 system.time(out - foo$two==A)
   user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.
 system.time(out - unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed this 
 up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Erik Iverson
I still don't understand what you are doing.  Can you make a small 
example that shows what you have and what you want?


Is ?split what you are after?

Emmanuel Levy wrote:

Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a hash key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name 1
L[[2]] = values of name 2
L[[3]] = values of name 3
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

To simplify:

n - 2.7e6;
x - factor(c(rep(A, n/2), rep(B, n/2)));

# Identify 'A':s
t1 - system.time(res - which(x == A));

# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 - system.time(res - which(as.character(x) == A));

# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
   user   system  elapsed
0.632653 1.60 0.754717

# Avoid coercing the factor, but instead coerce the level compared to
t3 - system.time(res - which(x == match(A, levels(x;

# ...but gives no speed up
print(t3/t1);
   user   system  elapsed
1.041667 1.00 1.018182

# But coercing the factor to integers does
t4 - system.time(res - which(as.integer(x) == match(A, levels(x
print(t4/t1);
usersystem   elapsed
0.417 0.000 0.3636364

So, the latter seems to be the fastest way to identify those elements.

My $.02

/Henrik


On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote:

Dear All,

I have a large data frame ( 270 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame df:


col1=sample(c(0,1),10, rep=T)
names = factor(c(rep(A,5),rep(B,5)))
df = data.frame(names,col1)
df

  names col1
1  A1
2  A0
3  A1
4  A0
5  A1
6  B0
7  B0
8  B1
9  B0
10 B0

I would like to tranform it in the form:


index = c(A,B)
col1[[1]]=df$col1[which(df$name==A)]
col1[[2]]=df$col1[which(df$name==B)]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.


n - 270
foo - data.frame(

+   one = sample(c(0,1), n, rep = T),
+   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
+   )

system.time(out - which(foo$two==A))

  user  system elapsed
 0.566   0.146   0.761

system.time(out - foo$two==A)

  user  system elapsed
 0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.

system.time(out - unstack(foo))

  user  system elapsed
 1.068   0.697   2.004

HTH

Peter


My problem is that the command:  *** which(df$name==A) ***
takes about 1 second because df is so big.

I was thinking that a level could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Sorry for being unclear, I thought the example above was clear enough.

I have a data frame of the form:

  name   info
1  YAL001C 1
2  YAL001C 1
3  YAL001C 1
4  YAL001C 1
5  YAL001C 0
6  YAL001C 1
7  YAL001C 1
8  YAL001C 1
9  YAL001C 1
10 YAL001C 1
...
...
~270 lines, and ~6000 different names.

which corresponds to yeast proteins + some info.
So there are about 6000 names like YAL001C

I would like to transform this data frame into the following form:

1/ a list, where each protein corresponds to an index, and the info is
the vector
 L[[1]]
[1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 
 L[[2]]
[1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 
etc.

2/ an index, which gives me the position of each protein in the list:
 index
[1] YAL001C YAL002W YAL003W YAL005C YAL007C ...

I hope this will be clearer!

I'll have a look right now that the split and hash.mat functions.

Thanks for your help,

Emmanuel




2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 system.time(out - which(foo$two==A))

  user  system elapsed
  0.566   0.146   0.761

 system.time(out - foo$two==A)

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.

 system.time(out - unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I
 am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed
 this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Wow great! Split was exactly what was needed. It takes about 1 second
for the whole operation :D

Thanks again - I can't believe I never used this function in the past.

All the best,

Emmanuel


2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 system.time(out - which(foo$two==A))

  user  system elapsed
  0.566   0.146   0.761

 system.time(out - foo$two==A)

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.

 system.time(out - unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I
 am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed
 this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread jim holtman
split if probably what you are after.  Here is an example:

 n - 270
 x - data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
 # split it into 6000 lists
 system.time(y - split(x$value, x$name))
   user  system elapsed
   0.800.201.07
 str(y[1:10])
List of 10
 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
 $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
 $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
 $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
 $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
 $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
 $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
 $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
 $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
 $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...

 Takes less that 1 second to split into 6000 lists.

On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Wow great! Split was exactly what was needed. It takes about 1 second
 for the whole operation :D

 Thanks again - I can't believe I never used this function in the past.

 All the best,

 Emmanuel


 2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 system.time(out - which(foo$two==A))

  user  system elapsed
  0.566   0.146   0.761

 system.time(out - foo$two==A)

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.

 system.time(out - unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I
 am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed
 this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE 

Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread jim holtman
If you want the index, then use:

 system.time(y - split(seq(nrow(x)), x$name))
   user  system elapsed
   0.810.060.88
 str(y[1:10])
List of 10
 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624
65767 75359 ...
 $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134
64414 64491 ...
 $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ...
 $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769
81783 83951 ...
 $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ...
 $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791
69855 74272 ...
 $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769
88878 90594 ...
 $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725
65243 73260 ...
 $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680
35828 50490 ...
 $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709
55505 70957 ...




On Wed, Aug 13, 2008 at 9:09 AM, jim holtman [EMAIL PROTECTED] wrote:
 split if probably what you are after.  Here is an example:

 n - 270
 x - data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
 # split it into 6000 lists
 system.time(y - split(x$value, x$name))
   user  system elapsed
   0.800.201.07
 str(y[1:10])
 List of 10
  $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
  $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
  $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
  $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
  $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
  $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
  $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
  $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
  $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
  $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...

  Takes less that 1 second to split into 6000 lists.

 On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Wow great! Split was exactly what was needed. It takes about 1 second
 for the whole operation :D

 Thanks again - I can't believe I never used this function in the past.

 All the best,

 Emmanuel


 2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 

[R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Emmanuel Levy
Dear All,

I have a large data frame ( 270 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
1  A1
2  A0
3  A1
4  A0
5  A1
6  B0
7  B0
8  B1
9  B0
10 B0

I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

My problem is that the command:  *** which(df$name==A) ***
takes about 1 second because df is so big.

I was thinking that a level could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Peter Cowan
Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.

 n - 270
 foo - data.frame(
+   one = sample(c(0,1), n, rep = T),
+   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
+   )
 system.time(out - which(foo$two==A))
   user  system elapsed
  0.566   0.146   0.761
 system.time(out - foo$two==A)
   user  system elapsed
  0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.
 system.time(out - unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

HTH

Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed this up.

 Best wishes,

 Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Henrik Bengtsson
To simplify:

n - 2.7e6;
x - factor(c(rep(A, n/2), rep(B, n/2)));

# Identify 'A':s
t1 - system.time(res - which(x == A));

# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 - system.time(res - which(as.character(x) == A));

# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
user   system  elapsed
0.632653 1.60 0.754717

# Avoid coercing the factor, but instead coerce the level compared to
t3 - system.time(res - which(x == match(A, levels(x;

# ...but gives no speed up
print(t3/t1);
user   system  elapsed
1.041667 1.00 1.018182

# But coercing the factor to integers does
t4 - system.time(res - which(as.integer(x) == match(A, levels(x
print(t4/t1);
 usersystem   elapsed
0.417 0.000 0.3636364

So, the latter seems to be the fastest way to identify those elements.

My $.02

/Henrik


On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:
 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would like 
 to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run for 
 me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(
 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )
 system.time(out - which(foo$two==A))
   user  system elapsed
  0.566   0.146   0.761
 system.time(out - foo$two==A)
   user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.
 system.time(out - unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.