Re: [R] how to efficiently compute set unique?

2010-06-22 Thread G FANG
Hi All,

I think I figured out what's the problem. I have been a matlab user,
so in all my codes, I maintain the as.matrix format, which is much
slower to do unique.

I tried to not do the as.matrix conversion, and now it takes just few
seconds to do unique, as well as other computations.

Thanks a lot Duncan, Steve, David, and Douglas,

Hopefully, this case can also help future matlab->R users who got
stucked in the matlab thinking style.

Gang


On Mon, Jun 21, 2010 at 7:01 PM, Douglas Bates  wrote:
> On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius  
> wrote:
>>
>> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:
>>
>>> On 21/06/2010 9:06 PM, G FANG wrote:

 Hi,

 I want to get the unique set from a large numeric k by 1 vector, k is
 in tens of millions

 when I used the matlab function unique, it takes less than 10 secs

 but when I tried to use the unique in R with similar CPU and memory,
 it is not done in minutes

 I am wondering, am I using the function in the right way?

 dim(cntxtn)
 [1] 13584763        1
 uniqueCntxt = unique(cntxtn);    # this is taking really long
>>>
>>> What type is cntxtn?  If I do that sort of thing on a numeric vector, it's
>>> quite fast:
>>>
>>> > x <- sample(10, size=13584763, replace=T)
>>> > system.time(unique(x))
>>>  user  system elapsed
>>>  3.61    0.14    3.75
>>
>> If it's a factor, it could be as simple as:
>>
>> levels(cntxtn)  # since the work of "unique-ification" has already been
>> done.
>
> Not quite.  When you generate a factor, as you do in your example, the
> levels correspond to the unique values of the original vector.  But
> when you take a subset of a factor the levels are preserved intact,
> even if some of those levels do not occur in the subset.  This is why
> there are unusual arguments with names like drop.unused.levels in
> functions like model.frame.  It is also a subtle difference in the
> behavior of factor(x) and as.factor(x) when x is already a factor.
>
>> ff <- factor(sample.int(200, 1000, replace = TRUE))
>> ff1 <- ff[1:40]
>> length(levels(ff))
> [1] 199
>> length(levels(ff1))
> [1] 199
>> length(levels(as.factor(ff1)))
> [1] 199
>> length(levels(factor(ff1)))
> [1] 34
>
>>> x <- factor(sample(10, size=13584763, replace=T))
>>> system.time(levels(x))
>>   user  system elapsed
>>      0       0       0
>>> system.time(y <- levels(x))
>>   user  system elapsed
>>      0       0       0
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to efficiently compute set unique?

2010-06-21 Thread Douglas Bates
On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius  wrote:
>
> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:
>
>> On 21/06/2010 9:06 PM, G FANG wrote:
>>>
>>> Hi,
>>>
>>> I want to get the unique set from a large numeric k by 1 vector, k is
>>> in tens of millions
>>>
>>> when I used the matlab function unique, it takes less than 10 secs
>>>
>>> but when I tried to use the unique in R with similar CPU and memory,
>>> it is not done in minutes
>>>
>>> I am wondering, am I using the function in the right way?
>>>
>>> dim(cntxtn)
>>> [1] 13584763        1
>>> uniqueCntxt = unique(cntxtn);    # this is taking really long
>>
>> What type is cntxtn?  If I do that sort of thing on a numeric vector, it's
>> quite fast:
>>
>> > x <- sample(10, size=13584763, replace=T)
>> > system.time(unique(x))
>>  user  system elapsed
>>  3.61    0.14    3.75
>
> If it's a factor, it could be as simple as:
>
> levels(cntxtn)  # since the work of "unique-ification" has already been
> done.

Not quite.  When you generate a factor, as you do in your example, the
levels correspond to the unique values of the original vector.  But
when you take a subset of a factor the levels are preserved intact,
even if some of those levels do not occur in the subset.  This is why
there are unusual arguments with names like drop.unused.levels in
functions like model.frame.  It is also a subtle difference in the
behavior of factor(x) and as.factor(x) when x is already a factor.

> ff <- factor(sample.int(200, 1000, replace = TRUE))
> ff1 <- ff[1:40]
> length(levels(ff))
[1] 199
> length(levels(ff1))
[1] 199
> length(levels(as.factor(ff1)))
[1] 199
> length(levels(factor(ff1)))
[1] 34

>> x <- factor(sample(10, size=13584763, replace=T))
>> system.time(levels(x))
>   user  system elapsed
>      0       0       0
>> system.time(y <- levels(x))
>   user  system elapsed
>      0       0       0

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to efficiently compute set unique?

2010-06-21 Thread Steve Taylor
The original question was about a matrix, not a vector and this is much slower:
 
x <- sample(10, size=13584763, replace=T)
dim(x) <- c(13584763, 1)
system.time(unique(x))
So the solution would be:
 
unique(as.vector(x))

>>> 

From: Duncan Murdoch 
To:G FANG 
CC:
Date: 22/Jun/2010 1:20p
Subject: Re: [R] how to efficiently compute set unique?
On 21/06/2010 9:06 PM, G FANG wrote:
> Hi,
>
> I want to get the unique set from a large numeric k by 1 vector, k is
> in tens of millions
>
> when I used the matlab function unique, it takes less than 10 secs
>
> but when I tried to use the unique in R with similar CPU and memory,
> it is not done in minutes
>
> I am wondering, am I using the function in the right way?
>
> dim(cntxtn)
> [1] 135847631
> uniqueCntxt = unique(cntxtn);# this is taking really long

What type is cntxtn?  If I do that sort of thing on a numeric vector, 
it's quite fast:

> x <- sample(10, size=13584763, replace=T)
> system.time(unique(x))
   user  system elapsed
   3.610.143.75

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide http://www.R ( http://www.r/ 
)-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to efficiently compute set unique?

2010-06-21 Thread David Winsemius


On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:


On 21/06/2010 9:06 PM, G FANG wrote:

Hi,

I want to get the unique set from a large numeric k by 1 vector, k is
in tens of millions

when I used the matlab function unique, it takes less than 10 secs

but when I tried to use the unique in R with similar CPU and memory,
it is not done in minutes

I am wondering, am I using the function in the right way?

dim(cntxtn)
[1] 135847631
uniqueCntxt = unique(cntxtn);# this is taking really long


What type is cntxtn?  If I do that sort of thing on a numeric  
vector, it's quite fast:


> x <- sample(10, size=13584763, replace=T)
> system.time(unique(x))
 user  system elapsed
 3.610.143.75


If it's a factor, it could be as simple as:

levels(cntxtn)  # since the work of "unique-ification" has already  
been done.


> x <- factor(sample(10, size=13584763, replace=T))
> system.time(levels(x))
   user  system elapsed
  0   0   0
> system.time(y <- levels(x))
   user  system elapsed
  0   0   0


--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to efficiently compute set unique?

2010-06-21 Thread Duncan Murdoch

On 21/06/2010 9:06 PM, G FANG wrote:

Hi,

I want to get the unique set from a large numeric k by 1 vector, k is
in tens of millions

when I used the matlab function unique, it takes less than 10 secs

but when I tried to use the unique in R with similar CPU and memory,
it is not done in minutes

I am wondering, am I using the function in the right way?

dim(cntxtn)
[1] 135847631
uniqueCntxt = unique(cntxtn);# this is taking really long


What type is cntxtn?  If I do that sort of thing on a numeric vector, 
it's quite fast:


> x <- sample(10, size=13584763, replace=T)
> system.time(unique(x))
  user  system elapsed
  3.610.143.75

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.