The bottleneck of ave is the call to interaction (i.e. not the call to 
split/lapply).

Therefore, the following code runs as expected (but I may miss something...):

ave2 <- function (x, ..., FUN = mean)
{
    if(missing(...))
        x[] <- FUN(x)
    else {
        #g <- interaction(...)
        g <- paste0(...)
        split(x,g) <- lapply(split(x, g), FUN)
    }
    x
}

df2$diff <- ave2(df2$val,
                 df2$id1,
                 df2$id2,
                 df2$id3,
                 FUN = function(i) c(diff(i), 0))



Of course I can also simply solve my current issue with:

df2$id123 <- paste0(df2$id1,
                    df2$id2,
                    df2$id3)

df2$diff <- ave(df2$val,
                df2$id123,
                FUN = function(i) c(diff(i), 0))



In addition, ave2 also avoid warnings in case of unused levels (see point 2) in 
my previous message).
________________________________________
De : SOEIRO Thomas
Envoyé : vendredi 12 mars 2021 23:59
À : r-devel@r-project.org
Objet : Potential improvements of ave?

Dear all,

I have two questions/suggestions about ave, but I am not sure if it's relevant 
for bug reports.



1) I have performance issues with ave in a case where I didn't expect it. The 
following code runs as expected:

set.seed(1)

df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
                  id2 = sample(1:3, 5e2, TRUE),
                  id3 = sample(1:5, 5e2, TRUE),
                  val = sample(1:300, 5e2, TRUE))

df1$diff <- ave(df1$val,
                df1$id1,
                df1$id2,
                df1$id3,
                FUN = function(i) c(diff(i), 0))

head(df1[order(df1$id1,
               df1$id2,
               df1$id3), ])

But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate 
vector of size 1110.0 Gb):

df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
                  id2 = sample(1:3, 5e2 * 1e4, TRUE),
                  id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
                  val = sample(1:300, 5e2 * 1e4, TRUE))

df2$diff <- ave(df2$val,
                df2$id1,
                df2$id2,
                df2$id3,
                FUN = function(i) c(diff(i), 0))

This use case does not seem extreme to me (e.g. aggregate et al work perfectly 
on this data.frame).
So my question is: Is this expected/intended/reasonable? i.e. Does ave need to 
be optimized?



2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid 
warnings in case of unused levels 
(https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
Is it relevant/possible to expose the drop argument explicitly?



Thanks,

Thomas

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to