Your proposed change (roughly, replacing interaction() by unique(paste())) slows down ave() considerably when there are long columns with lots of repeated rows.
I think that interaction(drop=TRUE, ...) can be changed to use less memory and be faster by making a separate branch for drop=TRUE that uses the following idiom for finding the unique rows in a data.frame: new.duplicated.data.frame <- function (x, incomparables = FALSE, fromLast = FALSE, ...) { dup <- !logical(nrow(x)) # all entries considered duplicated until proven otherwise for(column in x) { dup <- dup & duplicated(column, incomparables = incomparables, fromLast = fromLast) } dup } ave() could use the above directly or it could call interaction(drop=TRUE,...). On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas <thomas.soe...@ap-hm.fr> wrote: > > Dear all, > > Thank you for your consideration on this topic. > > I do not have enough knowledge of R internals to join the discussion about > sorting mechanisms. In fact, I did not get how ordering could help for ave as > the output must maintain the order of the input (because ave returns only x > and not the entiere data.frame). > > However, while the proposed workaround (i.e. paste0 instead of interaction, > cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not > solves the "bigger problem" of sorting, it is usable as is and solves the > issue. Therefore, what do you think about it? (i.e is it relevant for a > patch?) > > Thanks, > > Thomas > > > > ________________________________________ > > De : Abby Spurdle <spurdl...@gmail.com> > > Envoyé : lundi 15 mars 2021 10:22 > > À : SOEIRO Thomas > > Cc : r-devel@r-project.org > > Objet : Re: [Rd] Potential improvements of ave? > > > > Hi Thomas, > > > > These are some great suggestions. > > But I can't help but feel there's a much bigger problem here. > > > > Intuitively, the ave function could (or should) sort the data. > > Then the indexing step becomes almost trivial, in terms of both time > > and space complexity. > > And the ave function is not the only example of where a problem > > becomes much simpler, if the data is sorted. > > > > Historically, I've never found base R functions user-friendly for > > aggregation purposes, or for sorting. > > (At least, not by comparison to SQL). > > > > But that's not the main problem. > > It would seem preferable to sort the data, only once. > > (Rather than sorting it repeatedly, or not at all). > > > > Perhaps, objects such as vectors and data.frame(s) could have a > > boolean attribute, to indicate if they're sorted. > > Or functions such as ave could have a sorted argument. > > In either case, if true, the function assumes the data is sorted and > > applies a more efficient algorithm. > > > > > > B. > > > > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <thomas.soe...@ap-hm.fr> > > wrote: > >> > >> Dear all, > >> > >> I have two questions/suggestions about ave, but I am not sure if it's > >> relevant for bug reports. > >> > >> > >> > >> 1) I have performance issues with ave in a case where I didn't expect it. > >> The following code runs as expected: > >> > >> set.seed(1) > >> > >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> id2 = sample(1:3, 5e2, TRUE), > >> id3 = sample(1:5, 5e2, TRUE), > >> val = sample(1:300, 5e2, TRUE)) > >> > >> df1$diff <- ave(df1$val, > >> df1$id1, > >> df1$id2, > >> df1$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> head(df1[order(df1$id1, > >> df1$id2, > >> df1$id3), ]) > >> > >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot > >> allocate vector of size 1110.0 Gb): > >> > >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > >> id2 = sample(1:3, 5e2 * 1e4, TRUE), > >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > >> val = sample(1:300, 5e2 * 1e4, TRUE)) > >> > >> df2$diff <- ave(df2$val, > >> df2$id1, > >> df2$id2, > >> df2$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> This use case does not seem extreme to me (e.g. aggregate et al work > >> perfectly on this data.frame). > >> So my question is: Is this expected/intended/reasonable? i.e. Does ave > >> need to be optimized? > >> > >> > >> > >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to > >> avoid warnings in case of unused levels > >> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). > >> Is it relevant/possible to expose the drop argument explicitly? > >> > >> > >> > >> Thanks, > >> > >> Thomas > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel