Chuck et. al.:

As I said previously, my intuition about the relative efficiency of
tapply() and duplicated() in the context of this thread was wrong. But
I wondered exactly how and to what extent. So I've fooled around a bit
more and think I understand. Using the example I gave, the key is to
replace the duplicated.data.frame method and the inner data.frame
subscripting with the duplicated.default method via with() and the
interaction() function (paste() -ing instead takes extra time):

> system.time(z <-with(df,df[!duplicated(interaction(f,g),fromLast = TRUE),]))
   user  system elapsed
  0.039   0.006   0.045
>
> system.time(
+   {ix <- seq_len(nrow(df));
+    z <- with(df,df[tapply(ix,list(f,g),function(x)x[length(x)]),])
+    })
   user  system elapsed
  0.025   0.005   0.029


tapply() still appears slightly more efficient (which is still
surprising to me), but only slightly.


Hope this is informative.


Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Sep 2, 2016 at 1:48 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> Chuck:
>
> I think this is quite clever. But note that the which() is
> unnecessary: logical indicing suffices, e.g.
>
> df[!duplicated(df[,c("f","g")],fromLast = TRUE),]
>
> I thought that your approach would be faster because it moves
> comparisons from the tapply() to C code. But I was wrong. e.g. for 1e6
> rows:
>
>> set.seed(1001)
>> df <- data.frame(f =factor(sample(LETTERS[1:4],1e6,rep=TRUE)),
>                    +                 g
> =factor(sample(letters[1:6],1e6,rep=TRUE)),
>                    +                 y = runif(1e6))
>
> ##using duplicated()
>  > system.time(z <-df[!duplicated(df[,c("f","g")],fromLast = TRUE),])
> user  system elapsed
> 0.175   0.008   0.183
>
> ## Using tapply()
>  > system.time(
>     + {ix <- seq_len(nrow(df));
>     + z <- df[with(df,tapply(ix,list(f,g),function(x)x[length(x)])),]
>     + })
> user  system elapsed
> 0.025   0.003   0.028
>
>
> This illustrates the faultiness of my "intuition."  A guess would be
> that the subscripting to get the factor combinations and
> duplicated.data.frame method takes the extra time.
>
> Anyway...
>
> Best,
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Sep 2, 2016 at 11:50 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote:
>> On Fri, 2 Sep 2016, Bert Gunter wrote:
>> [snip]
>>>
>>>
>>> The "trick" is to use tapply() to select the necessary row indices of
>>> your data frame and forget about all the do.call and rbind stuff. e.g.
>>>
>>
>> I agree the way to go is "select the necessary row indices" but I get there
>> a different way. See below.
>>
>>>> set.seed(1001)
>>>> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
>>>
>>> +                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
>>> +                  y = runif(100))
>>>>
>>>>
>>>> ix <- seq_len(nrow(df))
>>>>
>>>> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
>>>> ix
>>>
>>>   a  b   c  d  e  f
>>> A 94 69 100 59 80 87
>>> B 89 57  65 90 75 88
>>> C 85 92  86 95 97 62
>>> D 47 73  72 74 99 96
>>
>>
>>
>>   jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE ))
>>
>>   xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix'
>>
>>    g
>> f     a   b   c   d   e   f
>>   A  94  69 100  59  80  87
>>   B  89  57  65  90  75  88
>>   C  85  92  86  95  97  62
>>   D  47  73  72  74  99  96
>>
>>
>> Chuck
>>
>>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to