I made a couple of a changes from the previous version: - don't use functions anyMissing or notSorted (which aren't in base R) - don't check for dup.row.names attribute (need to modify other functions before that is useful) I have not tested this with a wide variety of inputs; I'm assuming that you have some regression tests.
Here are the file differences. Let me know if you'd like a different format. $ diff -c dataframe.R dataframe2.R *** dataframe.R Thu Jul 3 15:48:12 2008 --- dataframe2.R Thu Jul 3 16:36:46 2008 *************** *** 530,535 **** --- 530,541 ---- x <- .Call("R_copyDFattr", xx, x, PACKAGE="base") oldClass(x) <- attr(x, "row.names") <- NULL + # Do not want to check for duplicates if don't need to + noDuplicateRowNames <- (is.logical(i) || + length(i) < 2 || + (is.numeric(i) && min(i, 0, na.rm=TRUE) < 0) || + (!any(is.na(i)) && all(i[-length(i)]<i[-1]))) + if(!missing(j)) { # df[i, j] x <- x[j] cols <- names(x) # needed for 'drop' *************** *** 579,592 **** ## row names might have NAs. if(is.null(rows)) rows <- attr(xx, "row.names") rows <- rows[i] ! if((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) { ! ## both will coerce integer 'rows' to character: ! if (!dup && is.character(rows)) dup <- "NA" %in% rows ! if(ina) ! rows[is.na(rows)] <- "NA" ! if(dup) ! rows <- make.unique(as.character(rows)) ! } ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm) if(is.null(rows)) rows <- attr(xx, "row.names")[i] --- 585,594 ---- ## row names might have NAs. if(is.null(rows)) rows <- attr(xx, "row.names") rows <- rows[i] ! if(any(is.na(rows))) ! rows[is.na(rows)] <- "NA" # coerces to integer ! if(!noDuplicateRowNames && any(duplicated(rows))) ! rows <- make.unique(as.character(rows)) # coerces to integer ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm) if(is.null(rows)) rows <- attr(xx, "row.names")[i] Here's some code for testing, and timings # Use: # R --no-init-file --no-site-file x <- data.frame(a=1:4, b=2:5) # Run these commands with the default and new versions of [.data.frame trace("duplicated") trace("make.unique") x[2:1] x[1] x[1:2] x[1:3, ] # save one call to duplicated(rows) x[c(T,F,F,T), ] # save one call to duplicated(rows) x[-1,] # save one call to duplicated(rows) x[-(1:2),] # save one call to duplicated(rows) x[3:1, ] x[c(1,3,2,4,3), ] untrace("duplicated") untrace("make.unique") # Timings # Run one of these lines, then everything afterward n <- 10^5 n <- 10^6 n <- 10^7 y <- data.frame(a=1:n, b=1:n) i <- 1:n system.time(temp <- y[i, ]) # n old new # 10^5 .128 .052 # 10^6 .237 .591 # 10^7 3.10 2.882 i <- rep(TRUE, n) system.time(temp <- y[i, ]) # n old new # 10^5 .157 .053 # 10^6 .787 .449 # 10^7 3.799 2.138 i <- -1 system.time(temp <- y[i, ]) # n old new # 10^5 .157 .051 # 10^6 .614 .497 # 10^7 4.163 2.482 i <- rep(1:(n/2), 2) # expect no speedup for this case system.time(temp <- y[i, ]) # n old new # 10^5 .559 .782 # 10^6 6.066 6.078 # Times shown are the user times reported by system.time # The time savings are mostly quite substantial in the # cases I expect a savings. # I've noticed a lot of variability in results from system.time, # so I don't view these as very accurate, and I don't worry # much about the cases where the time appears worse. On Thu, Jul 3, 2008 at 1:08 PM, Martin Maechler <[EMAIL PROTECTED]> wrote: > >>>>> "TH" == Tim Hesterberg <[EMAIL PROTECTED]> > >>>>> on Tue, 1 Jul 2008 15:23:53 -0700 writes: > > TH> There is a bug in the standard version of [.data.frame; > TH> it mixes up handling duplicates and NAs when subscripting rows. > > TH> x <- data.frame(x=1:3, y=2:4, row.names=c("a","b","NA")) > TH> y <- x[c(2:3, NA),] > TH> y > > TH> It creates a data frame with duplicate rows, but won't print. > > and that's a bug, indeed > ("introduced" to R version 2.5.0, when the [.data.frame code was much > optimized for speed, with quite some care), and I have commited > a fix (and a regression test) to both R-devel and R-patched. > > Thanks a lot for the bug report, Tim! > > Now about your newly proposed code: > I'm sorry to say that it looks so much different from the source > code in > https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R > that I don't think we would accept it as a substitute, easily. > > Could you try to provide a minimal patch against the source code > and also a selfcontained example that exhibits the speed gain > you are aiming for ? > > Best regards, > Martin Maechler, ETH Zurich > > [.........................] > > > TH> On Tue, Jul 1, 2008 at 11:20 AM, Tim Hesterberg < > [EMAIL PROTECTED]> > TH> wrote: > > >> Below is a version of [.data.frame that is faster > >> for subscripting rows of large data frames; it avoids calling > >> duplicated(rows) > >> if there is no need to check for duplicate row names, when: > >> i is logical > >> attr(x, "dup.row.names") is not NULL (S+ compatibility) > >> i is numeric and negative > >> i is strictly increasing > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel