[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Emmanuel Levy
Dear All,

I have a large data frame ( 270 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame "df":

> col1=sample(c(0,1),10, rep=T)
> names = factor(c(rep("A",5),rep("B",5)))
> df = data.frame(names,col1)
> df
   names col1
1  A1
2  A0
3  A1
4  A0
5  A1
6  B0
7  B0
8  B1
9  B0
10 B0

I would like to tranform it in the form:

> index = c("A","B")
> col1[[1]]=df$col1[which(df$name=="A")]
> col1[[2]]=df$col1[which(df$name=="B")]

My problem is that the command:  *** which(df$name=="A") ***
takes about 1 second because df is so big.

I was thinking that a "level" could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Peter Cowan
Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:
> Dear All,
>
> I have a large data frame ( 270 lines and 14 columns), and I would like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
>   names col1
> 1  A1
> 2  A0
> 3  A1
> 4  A0
> 5  A1
> 6  B0
> 7  B0
> 8  B1
> 9  B0
> 10 B0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.

> n <- 270
> foo <- data.frame(
+   one = sample(c(0,1), n, rep = T),
+   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+   )
> system.time(out <- which(foo$two=="A"))
   user  system elapsed
  0.566   0.146   0.761
> system.time(out <- foo$two=="A")
   user  system elapsed
  0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.
> system.time(out <- unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

HTH

Peter

> My problem is that the command:  *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed this up.
>
> Best wishes,
>
> Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Henrik Bengtsson
To simplify:

n <- 2.7e6;
x <- factor(c(rep("A", n/2), rep("B", n/2)));

# Identify 'A':s
t1 <- system.time(res <- which(x == "A"));

# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 <- system.time(res <- which(as.character(x) == "A"));

# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
user   system  elapsed
0.632653 1.60 0.754717

# Avoid coercing the factor, but instead coerce the level compared to
t3 <- system.time(res <- which(x == match("A", levels(x;

# ...but gives no speed up
print(t3/t1);
user   system  elapsed
1.041667 1.00 1.018182

# But coercing the factor to integers does
t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
print(t4/t1);
 usersystem   elapsed
0.417 0.000 0.3636364

So, the latter seems to be the fastest way to identify those elements.

My $.02

/Henrik


On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:
> Emmanuel,
>
> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:
>> Dear All,
>>
>> I have a large data frame ( 270 lines and 14 columns), and I would like 
>> to
>> extract the information in a particular way illustrated below:
>>
>>
>> Given a data frame "df":
>>
>>> col1=sample(c(0,1),10, rep=T)
>>> names = factor(c(rep("A",5),rep("B",5)))
>>> df = data.frame(names,col1)
>>> df
>>   names col1
>> 1  A1
>> 2  A0
>> 3  A1
>> 4  A0
>> 5  A1
>> 6  B0
>> 7  B0
>> 8  B1
>> 9  B0
>> 10 B0
>>
>> I would like to tranform it in the form:
>>
>>> index = c("A","B")
>>> col1[[1]]=df$col1[which(df$name=="A")]
>>> col1[[2]]=df$col1[which(df$name=="B")]
>
> I'm not sure I fully understand your problem, you example would not run for 
> me.
>
> You could get a small speedup by omitting which(), you can subset by a
> logical vector also which give a small speedup.
>
>> n <- 270
>> foo <- data.frame(
> +   one = sample(c(0,1), n, rep = T),
> +   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
> +   )
>> system.time(out <- which(foo$two=="A"))
>   user  system elapsed
>  0.566   0.146   0.761
>> system.time(out <- foo$two=="A")
>   user  system elapsed
>  0.429   0.075   0.588
>
> You might also find use for unstack(), though I didn't see a speedup.
>> system.time(out <- unstack(foo))
>   user  system elapsed
>  1.068   0.697   2.004
>
> HTH
>
> Peter
>
>> My problem is that the command:  *** which(df$name=="A") ***
>> takes about 1 second because df is so big.
>>
>> I was thinking that a "level" could maybe be accessed instantly but I am not
>> sure about how to do it.
>>
>> I would be very grateful for any advice that would allow me to speed this up.
>>
>> Best wishes,
>>
>> Emmanuel
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a "hash" key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name "1"
L[[2]] = values of name "2"
L[[3]] = values of name "3"
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:
> To simplify:
>
> n <- 2.7e6;
> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>
> # Identify 'A':s
> t1 <- system.time(res <- which(x == "A"));
>
> # To compare a factor to a string, the factor is in practice
> # coerced to a character vector.
> t2 <- system.time(res <- which(as.character(x) == "A"));
>
> # Interestingly enough, this seems to be faster (repeated many times)
> # Don't know why.
> print(t2/t1);
>user   system  elapsed
> 0.632653 1.60 0.754717
>
> # Avoid coercing the factor, but instead coerce the level compared to
> t3 <- system.time(res <- which(x == match("A", levels(x;
>
> # ...but gives no speed up
> print(t3/t1);
>user   system  elapsed
> 1.041667 1.00 1.018182
>
> # But coercing the factor to integers does
> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
> print(t4/t1);
> usersystem   elapsed
> 0.417 0.000 0.3636364
>
> So, the latter seems to be the fastest way to identify those elements.
>
> My $.02
>
> /Henrik
>
>
> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:
>> Emmanuel,
>>
>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:
>>> Dear All,
>>>
>>> I have a large data frame ( 270 lines and 14 columns), and I would like 
>>> to
>>> extract the information in a particular way illustrated below:
>>>
>>>
>>> Given a data frame "df":
>>>
 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep("A",5),rep("B",5)))
 df = data.frame(names,col1)
 df
>>>   names col1
>>> 1  A1
>>> 2  A0
>>> 3  A1
>>> 4  A0
>>> 5  A1
>>> 6  B0
>>> 7  B0
>>> 8  B1
>>> 9  B0
>>> 10 B0
>>>
>>> I would like to tranform it in the form:
>>>
 index = c("A","B")
 col1[[1]]=df$col1[which(df$name=="A")]
 col1[[2]]=df$col1[which(df$name=="B")]
>>
>> I'm not sure I fully understand your problem, you example would not run for 
>> me.
>>
>> You could get a small speedup by omitting which(), you can subset by a
>> logical vector also which give a small speedup.
>>
>>> n <- 270
>>> foo <- data.frame(
>> +   one = sample(c(0,1), n, rep = T),
>> +   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
>> +   )
>>> system.time(out <- which(foo$two=="A"))
>>   user  system elapsed
>>  0.566   0.146   0.761
>>> system.time(out <- foo$two=="A")
>>   user  system elapsed
>>  0.429   0.075   0.588
>>
>> You might also find use for unstack(), though I didn't see a speedup.
>>> system.time(out <- unstack(foo))
>>   user  system elapsed
>>  1.068   0.697   2.004
>>
>> HTH
>>
>> Peter
>>
>>> My problem is that the command:  *** which(df$name=="A") ***
>>> takes about 1 second because df is so big.
>>>
>>> I was thinking that a "level" could maybe be accessed instantly but I am not
>>> sure about how to do it.
>>>
>>> I would be very grateful for any advice that would allow me to speed this 
>>> up.
>>>
>>> Best wishes,
>>>
>>> Emmanuel
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Erik Iverson
I still don't understand what you are doing.  Can you make a small 
example that shows what you have and what you want?


Is ?split what you are after?

Emmanuel Levy wrote:

Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a "hash" key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name "1"
L[[2]] = values of name "2"
L[[3]] = values of name "3"
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:

To simplify:

n <- 2.7e6;
x <- factor(c(rep("A", n/2), rep("B", n/2)));

# Identify 'A':s
t1 <- system.time(res <- which(x == "A"));

# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 <- system.time(res <- which(as.character(x) == "A"));

# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
   user   system  elapsed
0.632653 1.60 0.754717

# Avoid coercing the factor, but instead coerce the level compared to
t3 <- system.time(res <- which(x == match("A", levels(x;

# ...but gives no speed up
print(t3/t1);
   user   system  elapsed
1.041667 1.00 1.018182

# But coercing the factor to integers does
t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
print(t4/t1);
usersystem   elapsed
0.417 0.000 0.3636364

So, the latter seems to be the fastest way to identify those elements.

My $.02

/Henrik


On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:

Dear All,

I have a large data frame ( 270 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame "df":


col1=sample(c(0,1),10, rep=T)
names = factor(c(rep("A",5),rep("B",5)))
df = data.frame(names,col1)
df

  names col1
1  A1
2  A0
3  A1
4  A0
5  A1
6  B0
7  B0
8  B1
9  B0
10 B0

I would like to tranform it in the form:


index = c("A","B")
col1[[1]]=df$col1[which(df$name=="A")]
col1[[2]]=df$col1[which(df$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.


n <- 270
foo <- data.frame(

+   one = sample(c(0,1), n, rep = T),
+   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+   )

system.time(out <- which(foo$two=="A"))

  user  system elapsed
 0.566   0.146   0.761

system.time(out <- foo$two=="A")

  user  system elapsed
 0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.

system.time(out <- unstack(foo))

  user  system elapsed
 1.068   0.697   2.004

HTH

Peter


My problem is that the command:  *** which(df$name=="A") ***
takes about 1 second because df is so big.

I was thinking that a "level" could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Sorry for being unclear, I thought the example above was clear enough.

I have a data frame of the form:

  name   info
1  YAL001C 1
2  YAL001C 1
3  YAL001C 1
4  YAL001C 1
5  YAL001C 0
6  YAL001C 1
7  YAL001C 1
8  YAL001C 1
9  YAL001C 1
10 YAL001C 1
...
...
~270 lines, and ~6000 different names.

which corresponds to yeast proteins + some info.
So there are about 6000 names like "YAL001C"

I would like to transform this data frame into the following form:

1/ a list, where each protein corresponds to an index, and the info is
the vector
> L[[1]]
[1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 
> L[[2]]
[1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 
etc.

2/ an index, which gives me the position of each protein in the list:
> index
[1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" ...

I hope this will be clearer!

I'll have a look right now that the split and hash.mat functions.

Thanks for your help,

Emmanuel




2008/8/13 Erik Iverson <[EMAIL PROTECTED]>:
> I still don't understand what you are doing.  Can you make a small example
> that shows what you have and what you want?
>
> Is ?split what you are after?
>
> Emmanuel Levy wrote:
>>
>> Dear Peter and Henrik,
>>
>> Thanks for your replies - this helps speed up a bit, but I thought
>> there would be something much faster.
>>
>> What I mean is that I thought that a particular value of a level
>> could be accessed instantly, similarly to a "hash" key.
>>
>> Since I've got about 6000 levels in that data frame, it means that
>> making a list L of the form
>> L[[1]] = values of name "1"
>> L[[2]] = values of name "2"
>> L[[3]] = values of name "3"
>> ...
>> would take ~1hour.
>>
>> Best,
>>
>> Emmanuel
>>
>>
>>
>>
>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:
>>>
>>> To simplify:
>>>
>>> n <- 2.7e6;
>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>>
>>> # Identify 'A':s
>>> t1 <- system.time(res <- which(x == "A"));
>>>
>>> # To compare a factor to a string, the factor is in practice
>>> # coerced to a character vector.
>>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>>
>>> # Interestingly enough, this seems to be faster (repeated many times)
>>> # Don't know why.
>>> print(t2/t1);
>>>   user   system  elapsed
>>> 0.632653 1.60 0.754717
>>>
>>> # Avoid coercing the factor, but instead coerce the level compared to
>>> t3 <- system.time(res <- which(x == match("A", levels(x;
>>>
>>> # ...but gives no speed up
>>> print(t3/t1);
>>>   user   system  elapsed
>>> 1.041667 1.00 1.018182
>>>
>>> # But coercing the factor to integers does
>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
>>> print(t4/t1);
>>>usersystem   elapsed
>>> 0.417 0.000 0.3636364
>>>
>>> So, the latter seems to be the fastest way to identify those elements.
>>>
>>> My $.02
>>>
>>> /Henrik
>>>
>>>
>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]>
 wrote:
>
> Dear All,
>
> I have a large data frame ( 270 lines and 14 columns), and I would
> like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
>
>  names col1
> 1  A1
> 2  A0
> 3  A1
> 4  A0
> 5  A1
> 6  B0
> 7  B0
> 8  B1
> 9  B0
> 10 B0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

> n <- 270
> foo <- data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
 +   )
>
> system.time(out <- which(foo$two=="A"))

  user  system elapsed
  0.566   0.146   0.761
>
> system.time(out <- foo$two=="A")

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.
>
> system.time(out <- unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

> My problem is that the command:  *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I
> am not
> sure about how to do it.
>
> I would be very grateful for any advice that would all

Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Wow great! Split was exactly what was needed. It takes about 1 second
for the whole operation :D

Thanks again - I can't believe I never used this function in the past.

All the best,

Emmanuel


2008/8/13 Erik Iverson <[EMAIL PROTECTED]>:
> I still don't understand what you are doing.  Can you make a small example
> that shows what you have and what you want?
>
> Is ?split what you are after?
>
> Emmanuel Levy wrote:
>>
>> Dear Peter and Henrik,
>>
>> Thanks for your replies - this helps speed up a bit, but I thought
>> there would be something much faster.
>>
>> What I mean is that I thought that a particular value of a level
>> could be accessed instantly, similarly to a "hash" key.
>>
>> Since I've got about 6000 levels in that data frame, it means that
>> making a list L of the form
>> L[[1]] = values of name "1"
>> L[[2]] = values of name "2"
>> L[[3]] = values of name "3"
>> ...
>> would take ~1hour.
>>
>> Best,
>>
>> Emmanuel
>>
>>
>>
>>
>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:
>>>
>>> To simplify:
>>>
>>> n <- 2.7e6;
>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>>>
>>> # Identify 'A':s
>>> t1 <- system.time(res <- which(x == "A"));
>>>
>>> # To compare a factor to a string, the factor is in practice
>>> # coerced to a character vector.
>>> t2 <- system.time(res <- which(as.character(x) == "A"));
>>>
>>> # Interestingly enough, this seems to be faster (repeated many times)
>>> # Don't know why.
>>> print(t2/t1);
>>>   user   system  elapsed
>>> 0.632653 1.60 0.754717
>>>
>>> # Avoid coercing the factor, but instead coerce the level compared to
>>> t3 <- system.time(res <- which(x == match("A", levels(x;
>>>
>>> # ...but gives no speed up
>>> print(t3/t1);
>>>   user   system  elapsed
>>> 1.041667 1.00 1.018182
>>>
>>> # But coercing the factor to integers does
>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
>>> print(t4/t1);
>>>usersystem   elapsed
>>> 0.417 0.000 0.3636364
>>>
>>> So, the latter seems to be the fastest way to identify those elements.
>>>
>>> My $.02
>>>
>>> /Henrik
>>>
>>>
>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]>
 wrote:
>
> Dear All,
>
> I have a large data frame ( 270 lines and 14 columns), and I would
> like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
>
>  names col1
> 1  A1
> 2  A0
> 3  A1
> 4  A0
> 5  A1
> 6  B0
> 7  B0
> 8  B1
> 9  B0
> 10 B0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

> n <- 270
> foo <- data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
 +   )
>
> system.time(out <- which(foo$two=="A"))

  user  system elapsed
  0.566   0.146   0.761
>
> system.time(out <- foo$two=="A")

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.
>
> system.time(out <- unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

> My problem is that the command:  *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I
> am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed
> this up.
>
> Best wishes,
>
> Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/l

Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread jim holtman
split if probably what you are after.  Here is an example:

> n <- 270
> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
> # split it into 6000 lists
> system.time(y <- split(x$value, x$name))
   user  system elapsed
   0.800.201.07
> str(y[1:10])
List of 10
 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
 $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
 $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
 $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
 $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
 $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
 $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
 $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
 $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
 $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...
>
 Takes less that 1 second to split into 6000 lists.

On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:
> Wow great! Split was exactly what was needed. It takes about 1 second
> for the whole operation :D
>
> Thanks again - I can't believe I never used this function in the past.
>
> All the best,
>
> Emmanuel
>
>
> 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>:
>> I still don't understand what you are doing.  Can you make a small example
>> that shows what you have and what you want?
>>
>> Is ?split what you are after?
>>
>> Emmanuel Levy wrote:
>>>
>>> Dear Peter and Henrik,
>>>
>>> Thanks for your replies - this helps speed up a bit, but I thought
>>> there would be something much faster.
>>>
>>> What I mean is that I thought that a particular value of a level
>>> could be accessed instantly, similarly to a "hash" key.
>>>
>>> Since I've got about 6000 levels in that data frame, it means that
>>> making a list L of the form
>>> L[[1]] = values of name "1"
>>> L[[2]] = values of name "2"
>>> L[[3]] = values of name "3"
>>> ...
>>> would take ~1hour.
>>>
>>> Best,
>>>
>>> Emmanuel
>>>
>>>
>>>
>>>
>>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:

 To simplify:

 n <- 2.7e6;
 x <- factor(c(rep("A", n/2), rep("B", n/2)));

 # Identify 'A':s
 t1 <- system.time(res <- which(x == "A"));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 <- system.time(res <- which(as.character(x) == "A"));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 <- system.time(res <- which(x == match("A", levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:
>
> Emmanuel,
>
> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]>
> wrote:
>>
>> Dear All,
>>
>> I have a large data frame ( 270 lines and 14 columns), and I would
>> like to
>> extract the information in a particular way illustrated below:
>>
>>
>> Given a data frame "df":
>>
>>> col1=sample(c(0,1),10, rep=T)
>>> names = factor(c(rep("A",5),rep("B",5)))
>>> df = data.frame(names,col1)
>>> df
>>
>>  names col1
>> 1  A1
>> 2  A0
>> 3  A1
>> 4  A0
>> 5  A1
>> 6  B0
>> 7  B0
>> 8  B1
>> 9  B0
>> 10 B0
>>
>> I would like to tranform it in the form:
>>
>>> index = c("A","B")
>>> col1[[1]]=df$col1[which(df$name=="A")]
>>> col1[[2]]=df$col1[which(df$name=="B")]
>
> I'm not sure I fully understand your problem, you example would not run
> for me.
>
> You could get a small speedup by omitting which(), you can subset by a
> logical vector also which give a small speedup.
>
>> n <- 270
>> foo <- data.frame(
>
> +   one = sample(c(0,1), n, rep = T),
> +   two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
> +   )
>>
>> system.time(out <- which(foo$two=="A"))
>
>  user  system elapsed
>  0.566   0.146   0.761
>>
>> system.time(out <- foo$two=="A")
>
>  user  system elapsed
>  0.429   0.075   0.588
>
> You might also find use for unstack(), though I didn't see a speedup.
>>
>> system.time(out <- unstack(foo))
>
>  user  system elapsed
>  1.068   0.697   2.004
>
> HTH
>

Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread jim holtman
If you want the index, then use:

> system.time(y <- split(seq(nrow(x)), x$name))
   user  system elapsed
   0.810.060.88
> str(y[1:10])
List of 10
 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624
65767 75359 ...
 $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134
64414 64491 ...
 $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ...
 $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769
81783 83951 ...
 $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ...
 $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791
69855 74272 ...
 $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769
88878 90594 ...
 $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725
65243 73260 ...
 $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680
35828 50490 ...
 $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709
55505 70957 ...
>
>


On Wed, Aug 13, 2008 at 9:09 AM, jim holtman <[EMAIL PROTECTED]> wrote:
> split if probably what you are after.  Here is an example:
>
>> n <- 270
>> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n))
>> # split it into 6000 lists
>> system.time(y <- split(x$value, x$name))
>   user  system elapsed
>   0.800.201.07
>> str(y[1:10])
> List of 10
>  $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ...
>  $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ...
>  $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ...
>  $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ...
>  $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ...
>  $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ...
>  $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ...
>  $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ...
>  $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ...
>  $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ...
>>
>  Takes less that 1 second to split into 6000 lists.
>
> On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:
>> Wow great! Split was exactly what was needed. It takes about 1 second
>> for the whole operation :D
>>
>> Thanks again - I can't believe I never used this function in the past.
>>
>> All the best,
>>
>> Emmanuel
>>
>>
>> 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>:
>>> I still don't understand what you are doing.  Can you make a small example
>>> that shows what you have and what you want?
>>>
>>> Is ?split what you are after?
>>>
>>> Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a "hash" key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name "1"
 L[[2]] = values of name "2"
 L[[3]] = values of name "3"
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:
>
> To simplify:
>
> n <- 2.7e6;
> x <- factor(c(rep("A", n/2), rep("B", n/2)));
>
> # Identify 'A':s
> t1 <- system.time(res <- which(x == "A"));
>
> # To compare a factor to a string, the factor is in practice
> # coerced to a character vector.
> t2 <- system.time(res <- which(as.character(x) == "A"));
>
> # Interestingly enough, this seems to be faster (repeated many times)
> # Don't know why.
> print(t2/t1);
>   user   system  elapsed
> 0.632653 1.60 0.754717
>
> # Avoid coercing the factor, but instead coerce the level compared to
> t3 <- system.time(res <- which(x == match("A", levels(x;
>
> # ...but gives no speed up
> print(t3/t1);
>   user   system  elapsed
> 1.041667 1.00 1.018182
>
> # But coercing the factor to integers does
> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x
> print(t4/t1);
>usersystem   elapsed
> 0.417 0.000 0.3636364
>
> So, the latter seems to be the fastest way to identify those elements.
>
> My $.02
>
> /Henrik
>
>
> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:
>>
>> Emmanuel,
>>
>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Dear All,
>>>
>>> I have a large data frame ( 270 lines and 14 columns), and I would
>>> like to
>>> extract the information in a particular way illustrated below:
>>>
>>>
>>> Given a data frame "df":
>>>
 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep("A",5),rep("B",5)))
 df = data.frame(names,col1)
 df
>>>
>>>  names col1
>>> 1  A1
>>> 2  A0
>>> 3  A