Dear R developers, Recently, I was busy comparing different versions of several packages. Tired of going back and forth between R and diff, I created a simple file comparison function in R that I found quite useful. For an efficient and familiar interface I called it diff.character() and ran things like:
diff("old/R/foo.R", "new/R/foo.R") Before long, I found the need for a directory-wide comparison and added support for: diff("old/R", "new/R") I have now revisited and fine-polished this function to a point where I'd like to humbly suggest that diff.character() could be incorporated into the base package. See attached files and patch based on the current SVN trunk. It can be tested quickly by sourcing diff.R, or by building R. The examples in diff.character.html are somewhat contrived, in the absence of good example files to compare. You will probably have better example files to compare from your own work. Clearly, the functionality differs considerably from the default diff() method that operates on a single x vector, but in the broad sense, they're both about showing differences. For most programmers, calling diff() on two files or directories is already a part of muscle memory, both intuitive and efficient. There are a couple of CRAN packages (diffobj, diffR) that can compare files but not directories. They have package dependencies and return objects that are more complex (S4, HTML) than the plain list returned by diff.character(). This basic utility does by no means compete with Meld, Kompare, Emacs ediff, or other feature-rich diff applications, and using setdiff() as a basis for file comparison can be a somewhat simplistic approach. Nevertheless, I think many users may find this a handy tool to quickly compare scripts and data files. The method could be implemented differently, with fewer or more features, and I'm happy to amend according to the R Core Team decision. In the past, I have proposed additions to core R, some rejected and others accepted. This proposal fits a useful tool in a currently vacant diff.character() method at a low cost, using relatively few lines of base function calls and no compiled code. Its acceptance will probably depend on whether members of the R Core Team and/or CRAN Team might see it as a useful addition to their toolkit for interactive and scripted workflows, including R and CRAN maintenance. All the best, Arni
Compare Files Description: Show differences between files or directories. Usage: ## S3 method for class 'character' diff(x, y, file = NULL, ignore = NULL, lines = FALSE, short = TRUE, similar = FALSE, simple = TRUE, trimws = FALSE, ...) Arguments: x: a file or directory name. y: another file or directory name. file: if ‘x’ and ‘y’ are directories, then ‘file’ can be used to select a specific file that exists in both directories. ignore: patterns (regular expressions) to exclude from the output. lines: if ‘x’ and ‘y’ are directories, then ‘lines = TRUE’ compares the contents (lines) of files that exist in both directories, instead of listing filenames that are different between the directories. short: whether to produce short file paths for the output. similar: whether to show similarities instead of differences. simple: whether to replace ‘character(0)’ with ‘NULL’ in output, for compact display. trimws: whether to trim whitespace and exclude empty strings. ...: passed to ‘readLines’. Details: When comparing directories, two kinds of differences can occur: (1) filenames existing in one directory and not the other, and (2) files containing different lines of text. The purpose of the ‘lines’ argument is to select which of those two kinds of differences to show. If ‘x’ and ‘y’ are files (and not directories), the ‘file’ and ‘lines’ arguments are not applicable and will be ignored. Value: List showing differences as strings, or similarities if ‘similar = TRUE’. Note: This function uses ‘setdiff’ for the comparison, so line order, line numbers, and repeated lines are ignored. Subdirectories are excluded when comparing directories. This function has very basic features compared to full GUI applications such as WinMerge (Windows), Meld (Linux, Windows), Kompare (Linux), Ediff (Emacs), or the ‘diff’ shell command. The use of full GUI applications is recommended, but what this function offers in addition is: • a quick diff tool that is handy during an interactive R session, • a programmatic interface to analyze file differences as native R objects, and • a tool that works on all platforms, regardless of what software may be installed. The ‘short’ and ‘simple’ defaults are designed for interactive (human-readable) use, while ‘short = FALSE’ and ‘simple = FALSE’ produces a consistent number of list elements and retains longer paths. See Also: ‘diff’ is a generic function. Depending on ‘x’, it will show differences between numbers, date-time objects, files, directories, etc. ‘dir’, ‘readLines’, and ‘setdiff’ are the underlying functions performing the file and directory comparison. Examples: ## Not run: # Compare two files write(c("We", "are", "not"), file = "one.txt") write(c("We", "are", "the same"), file = "two.txt") diff("one.txt", "two.txt") diff("one.txt", "two.txt", similar = TRUE) file.remove("one.txt", "two.txt") # Another example with two files x <- system.file("DESCRIPTION", package = "base") y <- system.file("DESCRIPTION", package = "stats") diff(x, y) diff(x, y, similar = TRUE) # Filter out noise diff(x, y, ignore = c("Package:", "Title:", "Description:", "Built:")) # Compare filenames in two directories A <- system.file(package = "base") B <- system.file(package = "stats") diff(A, B) # these filenames are different diff(A, B, ignore = "^C") # exclude entries starting with C diff(A, B, similar = TRUE) # these filenames exist in both directories # Compare content of files that exist in both directories diff(A, B, lines = TRUE) # the INDEX files are very different diff(A, B, lines = TRUE, similar = TRUE) # but not completely different diff(A, B, lines = TRUE, n = 20) # demonstrate passing n to readLines diffs <- diff(A, B, lines = TRUE) # store comparison as list names(diffs) # these files are different str(diffs, vec.len = 1) # first difference in each file # Alternative format diff(A, B, ignore = "^C") # short format diff(A, B, ignore = "^C", short = FALSE, simple = FALSE) # long format # Compare one file that exists in both directories diff(A, B, "DESCRIPTION") # same as diffs$DESCRIPTION diff(A, B, "INDEX", similar = TRUE, trimws = TRUE) # trim whitespace ## End(Not run)
Index: src/library/base/R/diff.R =================================================================== --- src/library/base/R/diff.R (revision 81853) +++ src/library/base/R/diff.R (working copy) @@ -39,3 +39,97 @@ class(r) <- oldClass(x) r } + +diff.character <- function(x, y, file = NULL, ignore = NULL, + lines = FALSE, short = TRUE, similar = FALSE, + simple = TRUE, trimws = FALSE, ...) +{ + ## Calculate A and B entries, containing filenames or lines of text + if (dir.exists(x) && dir.exists(y)) { + if (is.null(file)) { + if (lines) { + files <- intersect(dir(x), dir(y)) # excluding subdirs: + files <- files[!(files %in% list.dirs(c(x, y), + full.names = FALSE))] + out <- list() + for (f in files) { + out[[f]] <- diff.character(file.path(x, f), + file.path(y, f), ignore = ignore, lines = FALSE, + short = short, similar = similar, simple = simple, + trimws = trimws, ...) + } + if (simple) + out <- out[!sapply(out, is.null)] + return(out) + } + else { + A <- dir(x) # excluding subdirs: + A <- A[!(A %in% list.dirs(x, full.names = FALSE))] + B <- dir(y) + B <- B[!(B %in% list.dirs(y, full.names = FALSE))] + } + } + else { + A <- readLines(file.path(x, file), ...) + B <- readLines(file.path(y, file), ...) + } + } + else if (file.exists(x) && file.exists(y)) { + A <- readLines(x, ...) + B <- readLines(y, ...) + } + else { + if (!file.exists(x)) + stop("'", x, "' not found") + if (!file.exists(y)) + stop("'", y, "'not found") + } + + ## Compare + if (trimws) { + A <- trimws(A) + A <- A[A != ""] + B <- trimws(B) + B <- B[B != ""] + } + diffA <- if (similar) intersect(A, B) else setdiff(A, B) + diffB <- if (similar) intersect(B, A) else setdiff(B, A) + for (i in seq_along(ignore)) { + diffA <- grep(ignore[i], diffA, invert = TRUE, value = TRUE) + diffB <- grep(ignore[i], diffB, invert = TRUE, value = TRUE) + } + if (similar) { + out <- list(similar = diffA) + } + else { + out <- list(diffA, diffB) + names(out) <- if (short) short.name(x, y) else c(x, y) + } + + ## Replace character(0) with NULL + if (simple) + { + out[sapply(out, length) == 0] <- NULL + if (length(out) == 0) + out <- NULL + } + out +} + +short.name <- function(A, B) +{ + ## Convert \\ to / + A <- gsub("\\\\", "/", A) + B <- gsub("\\\\", "/", B) + + ## Distinguish between three cases + ## case 1: identical, nothing to do - only when user runs diff(x, x) + ## case 2: basename is unique, use that + ## case 3: basename is identical, cut off basename until it's unique + if (A == B) + c(A, B) + else if (basename(A) != basename(B)) # x/y/A.txt & x/y/B.txt + c(basename(A), basename(B)) # => A.txt & B.txt + else # x/A/y/n.txt & x/B/y/n.txt + short.name(dirname(A), dirname(B)) # => A & B +} Index: src/library/base/man/diff.Rd =================================================================== --- src/library/base/man/diff.Rd (revision 81853) +++ src/library/base/man/diff.Rd (working copy) @@ -32,7 +32,8 @@ \details{ \code{diff} is a generic function with a default method and ones for classes \code{"\link{ts}"}, \code{"\link{POSIXt}"} and - \code{"\link{Date}"}. + \code{"\link{Date}"}, as well as \code{\link{diff.character}} to + compare files and directories. \code{\link{NA}}'s propagate. } @@ -55,7 +56,7 @@ Wadsworth & Brooks/Cole. } \seealso{ - \code{\link{diff.ts}}, \code{\link{diffinv}}. + \code{\link{diff.character}}, \code{\link{diff.ts}}, \code{\link{diffinv}}. } \examples{ diff(1:10, 2) Index: src/library/base/man/diff.character.Rd =================================================================== --- src/library/base/man/diff.character.Rd (nonexistent) +++ src/library/base/man/diff.character.Rd (working copy) @@ -0,0 +1,120 @@ +\name{diff.character} +\alias{diff.character} +\title{Compare Files} +\description{Show differences between files or directories.} +\usage{ +\method{diff}{character}(x, y, file = NULL, ignore = NULL, + lines = FALSE, short = TRUE, similar = FALSE, simple = TRUE, + trimws = FALSE, \dots) +} +\arguments{ + \item{x}{a file or directory name.} + \item{y}{another file or directory name.} + \item{file}{if \code{x} and \code{y} are directories, then \code{file} + can be used to select a specific file that exists in both + directories.} + \item{ignore}{patterns (regular expressions) to exclude from the + output.} + \item{lines}{if \code{x} and \code{y} are directories, then + \code{lines = TRUE} compares the contents (lines) of files that + exist in both directories, instead of listing filenames that are + different between the directories.} + \item{short}{whether to produce short file paths for the output.} + \item{similar}{whether to show similarities instead of differences.} + \item{simple}{whether to replace \code{character(0)} with \code{NULL} + in output, for compact display.} + \item{trimws}{whether to trim whitespace and exclude empty strings.} + \item{\dots}{passed to \code{readLines}.} +} +\details{ + When comparing directories, two kinds of differences can occur: (1) + filenames existing in one directory and not the other, and (2) files + containing different lines of text. The purpose of the \code{lines} + argument is to select which of those two kinds of differences to show. + + If \code{x} and \code{y} are files (and not directories), the + \code{file} and \code{lines} arguments are not applicable and will be + ignored. +} +\value{ + List showing differences as strings, or similarities if + \code{similar = TRUE}. +} +\note{ + This function uses \code{setdiff} for the comparison, so line order, + line numbers, and repeated lines are ignored. Subdirectories are + excluded when comparing directories. + + This function has very basic features compared to full GUI + applications such as \emph{WinMerge} (Windows), \emph{Meld} (Linux, + Windows), \emph{Kompare} (Linux), \emph{Ediff} (Emacs), or the + \command{diff} shell command. The use of full GUI applications is + recommended, but what this function offers in addition is: + + \itemize{ + \item a quick diff tool that is handy during an interactive R + session, + \item a programmatic interface to analyze file differences as native + R objects, and + \item a tool that works on all platforms, regardless of what + software may be installed. + } + + The \code{short} and \code{simple} defaults are designed for + interactive (human-readable) use, while \code{short = FALSE} and + \code{simple = FALSE} produces a consistent number of list elements + and retains longer paths. +} +\seealso{ + \code{\link{diff}} is a generic function. Depending on \code{x}, it + will show differences between numbers, date-time objects, files, + directories, etc. + + \code{\link{dir}}, \code{\link{readLines}}, and \code{\link{setdiff}} + are the underlying functions performing the file and directory + comparison. +} +\examples{ +\dontrun{ + +# Compare two files +write(c("We", "are", "not"), file = "one.txt") +write(c("We", "are", "the same"), file = "two.txt") +diff("one.txt", "two.txt") +diff("one.txt", "two.txt", similar = TRUE) +file.remove("one.txt", "two.txt") + +# Another example with two files +x <- system.file("DESCRIPTION", package = "base") +y <- system.file("DESCRIPTION", package = "stats") +diff(x, y) +diff(x, y, similar = TRUE) + +# Filter out noise +diff(x, y, ignore = c("Package:", "Title:", "Description:", "Built:")) + +# Compare filenames in two directories +A <- system.file(package = "base") +B <- system.file(package = "stats") +diff(A, B) # these filenames are different +diff(A, B, ignore = "^C") # exclude entries starting with C +diff(A, B, similar = TRUE) # these filenames exist in both directories + +# Compare content of files that exist in both directories +diff(A, B, lines = TRUE) # the INDEX files are very different +diff(A, B, lines = TRUE, similar = TRUE) # but not completely different +diff(A, B, lines = TRUE, n = 20) # demonstrate passing n to readLines +diffs <- diff(A, B, lines = TRUE) # store comparison as list +names(diffs) # these files are different +str(diffs, vec.len = 1) # first difference in each file + +# Alternative format +diff(A, B, ignore = "^C") # short format +diff(A, B, ignore = "^C", short = FALSE, simple = FALSE) # long format + +# Compare one file that exists in both directories +diff(A, B, "DESCRIPTION") # same as diffs$DESCRIPTION +diff(A, B, "INDEX", similar = TRUE, trimws = TRUE) # trim whitespace +} +} +\keyword{file}
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel