On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
aggregate() is indeed a useful function in this case, but it only
returns the
columns by which it was grouped. Is there a way I can use this
while
simultaneously retaining all the other column values in the
dataframe?
e.g. add superfluous (yet pertinent for later) column containing any
information at all and retain it in the final output
I had exactly this kind of need many times, and I have finally
created a function for it, which I hope to include soon in an
upcoming package. Here is a run of it (I added an extra "A" column
containing just the numbers 1:8):
DF
id day diff A
1 1 01-01-09 0.5 1
2 1 01-01-09 0.7 2
3 2 01-01-09 0.2 3
4 2 01-01-09 0.4 4
5 1 01-02-09 0.1 5
6 1 01-02-09 0.3 6
7 2 01-02-09 0.3 7
8 2 01-02-09 0.4 8
byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
diff A id day
1 0.5 1 1 01-01-09
2 0.2 3 2 01-01-09
3 0.1 5 1 01-02-09
4 0.3 7 2 01-02-09
Would that do what you want?
I've appended the function byDataFrame, and its prerequisite, a
function parseIndexList. I'm not quite set on the names yet, but
anyway. Hope this helps. I haven't really tested it on large sets, it
might perform poorly. Any suggestions on speeding the code /
corrections are welcome.
Haris Skiadas
Department of Mathematics and Computer Science
Hanover College
parseIndexList <- function(indexList) {
# browser()
if (!is.list(indexList))
indexList <- as.list(indexList)
nI <- length(indexList)
namelist <- vector("list", nI)
names(namelist) <- names(indexList)
extent <- integer(nI)
nx <- length(indexList[[1]])
one <- as.integer(1)
group <- rep.int(one, nx)
ngroup <- one
for (i in seq.int(indexList)) {
index <- as.factor(indexList[[i]])
if (length(index) != nx)
stop("arguments must have same length")
namelist[[i]] <- sort(unique(indexList[[i]]))
extent[i] <- length(namelist[[i]])
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
nms <- do.call(expand.grid, namelist)
ind <- unique(sort(group))
res <- data.frame(index=ind, nms[ind, , drop=FALSE])
return(list(cases=group, groups=res))
}
byDataFrame <- function (data, INDEX, FUN, newnames,
omit.index.cols=TRUE, ...) {
# # Part of the code shamelessly stolen from tapply
IND <- eval(substitute(INDEX), data)
nms <- as.character(as.list(substitute(INDEX)))
if (!is.list(IND)) {
IND <- list(IND)
names(IND) <- nms
} else {
names(IND) <- nms[-1]
}
funname <- paste(as.character(substitute(FUN)), collapse=".")
indexInfo <- parseIndexList(IND)
FUNx <- if (omit.index.cols) {
omit.cols <- match(names(indexInfo$groups)[-1], names(data))
function(x, ...) FUN(data[x, -omit.cols], ...)
} else {
function(x, ...) FUN(data[x, ], ...)
}
ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
index <- as.numeric(names(ans))
if (!is.data.frame(ans[[1]])) {
ans <- lapply(ans, function(x) {
dframe <- as.data.frame(t(x))
if (is.null(names(x)))
names(dframe) <- funname
dframe
})
}
lengths <- sapply(ans, nrow)
ans <- do.call(rbind, ans)
if (!missing(newnames))
names(ans) <- newnames
nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
res <- cbind(ans, nms)
res
}