[Rd] Proposed diff.character() method

Arni Magnusson Tue, 08 Mar 2022 15:53:20 -0800

Dear R developers,

Recently, I was busy comparing different versions of several packages.
Tired of going back and forth between R and diff, I created a simple
file comparison function in R that I found quite useful. For an
efficient and familiar interface I called it diff.character() and ran
things like:


  diff("old/R/foo.R", "new/R/foo.R")

Before long, I found the need for a directory-wide comparison and
added support for:

  diff("old/R", "new/R")

I have now revisited and fine-polished this function to a point where
I'd like to humbly suggest that diff.character() could be incorporated
into the base package. See attached files and patch based on the
current SVN trunk. It can be tested quickly by sourcing diff.R, or by
building R.

The examples in diff.character.html are somewhat contrived, in the
absence of good example files to compare. You will probably have
better example files to compare from your own work.

Clearly, the functionality differs considerably from the default
diff() method that operates on a single x vector, but in the broad
sense, they're both about showing differences. For most programmers,
calling diff() on two files or directories is already a part of muscle
memory, both intuitive and efficient.

There are a couple of CRAN packages (diffobj, diffR) that can compare
files but not directories. They have package dependencies and return
objects that are more complex (S4, HTML) than the plain list returned
by diff.character().

This basic utility does by no means compete with Meld, Kompare, Emacs
ediff, or other feature-rich diff applications, and using setdiff() as
a basis for file comparison can be a somewhat simplistic approach.
Nevertheless, I think many users may find this a handy tool to quickly
compare scripts and data files. The method could be implemented
differently, with fewer or more features, and I'm happy to amend
according to the R Core Team decision.

In the past, I have proposed additions to core R, some rejected and
others accepted. This proposal fits a useful tool in a currently
vacant diff.character() method at a low cost, using relatively few
lines of base function calls and no compiled code. Its acceptance will
probably depend on whether members of the R Core Team and/or CRAN Team
might see it as a useful addition to their toolkit for interactive and
scripted workflows, including R and CRAN maintenance.

All the best,
Arni

Compare Files

Description:

     Show differences between files or directories.

Usage:

     ## S3 method for class 'character'
     diff(x, y, file = NULL, ignore = NULL, lines = FALSE, short = TRUE,
          similar = FALSE, simple = TRUE, trimws = FALSE, ...)

Arguments:

       x: a file or directory name.

       y: another file or directory name.

    file: if ‘x’ and ‘y’ are directories, then ‘file’ can be used to
          select a specific file that exists in both directories.

  ignore: patterns (regular expressions) to exclude from the output.

   lines: if ‘x’ and ‘y’ are directories, then ‘lines = TRUE’ compares
          the contents (lines) of files that exist in both directories,
          instead of listing filenames that are different between the
          directories.

   short: whether to produce short file paths for the output.

 similar: whether to show similarities instead of differences.

  simple: whether to replace ‘character(0)’ with ‘NULL’ in output, for
          compact display.

  trimws: whether to trim whitespace and exclude empty strings.

     ...: passed to ‘readLines’.

Details:

     When comparing directories, two kinds of differences can occur:
     (1) filenames existing in one directory and not the other, and
     (2) files containing different lines of text. The purpose of the
     ‘lines’ argument is to select which of those two kinds of
     differences to show.

     If ‘x’ and ‘y’ are files (and not directories), the ‘file’ and
     ‘lines’ arguments are not applicable and will be ignored.

Value:

     List showing differences as strings, or similarities if
     ‘similar = TRUE’.

Note:

     This function uses ‘setdiff’ for the comparison, so line order,
     line numbers, and repeated lines are ignored. Subdirectories are
     excluded when comparing directories.

     This function has very basic features compared to full GUI
     applications such as WinMerge (Windows), Meld (Linux,
     Windows), Kompare (Linux), Ediff (Emacs), or the ‘diff’ shell
     command. The use of full GUI applications is recommended, but what
     this function offers in addition is:

        • a quick diff tool that is handy during an interactive R
          session,

        • a programmatic interface to analyze file differences as
          native R objects, and

        • a tool that works on all platforms, regardless of what
          software may be installed.

     The ‘short’ and ‘simple’ defaults are designed for interactive
     (human-readable) use, while ‘short = FALSE’ and ‘simple = FALSE’
     produces a consistent number of list elements and retains longer
     paths.

See Also:

     ‘diff’ is a generic function. Depending on ‘x’, it will show
     differences between numbers, date-time objects, files,
     directories, etc.

     ‘dir’, ‘readLines’, and ‘setdiff’ are the underlying functions
     performing the file and directory comparison.

Examples:

## Not run:

# Compare two files
write(c("We", "are", "not"), file = "one.txt")
write(c("We", "are", "the same"), file = "two.txt")
diff("one.txt", "two.txt")
diff("one.txt", "two.txt", similar = TRUE)
file.remove("one.txt", "two.txt")

# Another example with two files
x <- system.file("DESCRIPTION", package = "base")
y <- system.file("DESCRIPTION", package = "stats")
diff(x, y)
diff(x, y, similar = TRUE)

# Filter out noise
diff(x, y, ignore = c("Package:", "Title:", "Description:", "Built:"))

# Compare filenames in two directories
A <- system.file(package = "base")
B <- system.file(package = "stats")
diff(A, B)                # these filenames are different
diff(A, B, ignore = "^C")   # exclude entries starting with C
diff(A, B, similar = TRUE)  # these filenames exist in both directories

# Compare content of files that exist in both directories
diff(A, B, lines = TRUE)            # the INDEX files are very different
diff(A, B, lines = TRUE, similar = TRUE)  # but not completely different
diff(A, B, lines = TRUE, n = 20)    # demonstrate passing n to readLines
diffs <- diff(A, B, lines = TRUE)   # store comparison as list
names(diffs)                        # these files are different
str(diffs, vec.len = 1)             # first difference in each file

# Alternative format
diff(A, B, ignore = "^C")                                 # short format
diff(A, B, ignore = "^C", short = FALSE, simple = FALSE)  # long format

# Compare one file that exists in both directories
diff(A, B, "DESCRIPTION")             # same as diffs$DESCRIPTION
diff(A, B, "INDEX", similar = TRUE, trimws = TRUE)  # trim whitespace
## End(Not run)

Index: src/library/base/R/diff.R
===================================================================
--- src/library/base/R/diff.R	(revision 81853)
+++ src/library/base/R/diff.R	(working copy)
@@ -39,3 +39,97 @@
     class(r) <- oldClass(x)
     r
 }
+
+diff.character <- function(x, y, file = NULL, ignore = NULL,
+                           lines = FALSE, short = TRUE, similar = FALSE,
+                           simple = TRUE, trimws = FALSE, ...)
+{
+    ## Calculate A and B entries, containing filenames or lines of text
+    if (dir.exists(x) && dir.exists(y)) {
+        if (is.null(file)) {
+            if (lines) {
+                files <- intersect(dir(x), dir(y))  # excluding subdirs:
+                files <- files[!(files %in% list.dirs(c(x, y),
+                  full.names = FALSE))]
+                out <- list()
+                for (f in files) {
+                    out[[f]] <- diff.character(file.path(x, f),
+                      file.path(y, f), ignore = ignore, lines = FALSE,
+                      short = short, similar = similar, simple = simple,
+                      trimws = trimws, ...)
+                }
+                if (simple)
+                    out <- out[!sapply(out, is.null)]
+                return(out)
+            }
+            else {
+                A <- dir(x)  # excluding subdirs:
+                A <- A[!(A %in% list.dirs(x, full.names = FALSE))]
+                B <- dir(y)
+                B <- B[!(B %in% list.dirs(y, full.names = FALSE))]
+            }
+        }
+        else {
+            A <- readLines(file.path(x, file), ...)
+            B <- readLines(file.path(y, file), ...)
+        }
+    }
+    else if (file.exists(x) && file.exists(y)) {
+        A <- readLines(x, ...)
+        B <- readLines(y, ...)
+    }
+    else {
+        if (!file.exists(x))
+            stop("'", x, "' not found")
+        if (!file.exists(y))
+            stop("'", y, "'not found")
+    }
+
+    ## Compare
+    if (trimws) {
+        A <- trimws(A)
+        A <- A[A != ""]
+        B <- trimws(B)
+        B <- B[B != ""]
+    }
+    diffA <- if (similar) intersect(A, B) else setdiff(A, B)
+    diffB <- if (similar) intersect(B, A) else setdiff(B, A)
+    for (i in seq_along(ignore)) {
+        diffA <- grep(ignore[i], diffA, invert = TRUE, value = TRUE)
+        diffB <- grep(ignore[i], diffB, invert = TRUE, value = TRUE)
+    }
+    if (similar) {
+        out <- list(similar = diffA)
+    }
+    else {
+        out <- list(diffA, diffB)
+        names(out) <- if (short) short.name(x, y) else c(x, y)
+    }
+
+    ## Replace character(0) with NULL
+    if (simple)
+    {
+        out[sapply(out, length) == 0] <- NULL
+        if (length(out) == 0)
+            out <- NULL
+    }
+    out
+}
+
+short.name <- function(A, B)
+{
+    ## Convert \\ to /
+    A <- gsub("\\\\", "/", A)
+    B <- gsub("\\\\", "/", B)
+
+    ## Distinguish between three cases
+    ## case 1: identical, nothing to do - only when user runs diff(x, x)
+    ## case 2: basename is unique, use that
+    ## case 3: basename is identical, cut off basename until it's unique
+    if (A == B)
+        c(A, B)
+    else if (basename(A) != basename(B))  # x/y/A.txt & x/y/B.txt
+        c(basename(A), basename(B))       # => A.txt & B.txt
+    else                                    # x/A/y/n.txt & x/B/y/n.txt
+        short.name(dirname(A), dirname(B))  # => A & B
+}
Index: src/library/base/man/diff.Rd
===================================================================
--- src/library/base/man/diff.Rd	(revision 81853)
+++ src/library/base/man/diff.Rd	(working copy)
@@ -32,7 +32,8 @@
 \details{
   \code{diff} is a generic function with a default method and ones for
   classes \code{"\link{ts}"}, \code{"\link{POSIXt}"} and
-  \code{"\link{Date}"}.
+  \code{"\link{Date}"}, as well as \code{\link{diff.character}} to
+  compare files and directories.
 
   \code{\link{NA}}'s propagate.
 }
@@ -55,7 +56,7 @@
   Wadsworth & Brooks/Cole.
 }
 \seealso{
-  \code{\link{diff.ts}}, \code{\link{diffinv}}.
+  \code{\link{diff.character}}, \code{\link{diff.ts}}, \code{\link{diffinv}}.
 }
 \examples{
 diff(1:10, 2)
Index: src/library/base/man/diff.character.Rd
===================================================================
--- src/library/base/man/diff.character.Rd	(nonexistent)
+++ src/library/base/man/diff.character.Rd	(working copy)
@@ -0,0 +1,120 @@
+\name{diff.character}
+\alias{diff.character}
+\title{Compare Files}
+\description{Show differences between files or directories.}
+\usage{
+\method{diff}{character}(x, y, file = NULL, ignore = NULL,
+     lines = FALSE, short = TRUE, similar = FALSE, simple = TRUE,
+     trimws = FALSE, \dots)
+}
+\arguments{
+  \item{x}{a file or directory name.}
+  \item{y}{another file or directory name.}
+  \item{file}{if \code{x} and \code{y} are directories, then \code{file}
+    can be used to select a specific file that exists in both
+    directories.}
+  \item{ignore}{patterns (regular expressions) to exclude from the
+    output.}
+  \item{lines}{if \code{x} and \code{y} are directories, then
+    \code{lines = TRUE} compares the contents (lines) of files that
+    exist in both directories, instead of listing filenames that are
+    different between the directories.}
+  \item{short}{whether to produce short file paths for the output.}
+  \item{similar}{whether to show similarities instead of differences.}
+  \item{simple}{whether to replace \code{character(0)} with \code{NULL}
+    in output, for compact display.}
+  \item{trimws}{whether to trim whitespace and exclude empty strings.}
+  \item{\dots}{passed to \code{readLines}.}
+}
+\details{
+  When comparing directories, two kinds of differences can occur: (1)
+  filenames existing in one directory and not the other, and (2) files
+  containing different lines of text. The purpose of the \code{lines}
+  argument is to select which of those two kinds of differences to show.
+
+  If \code{x} and \code{y} are files (and not directories), the
+  \code{file} and \code{lines} arguments are not applicable and will be
+  ignored.
+}
+\value{
+  List showing differences as strings, or similarities if
+  \code{similar = TRUE}.
+}
+\note{
+  This function uses \code{setdiff} for the comparison, so line order,
+  line numbers, and repeated lines are ignored. Subdirectories are
+  excluded when comparing directories.
+
+  This function has very basic features compared to full GUI
+  applications such as \emph{WinMerge} (Windows), \emph{Meld} (Linux,
+  Windows), \emph{Kompare} (Linux), \emph{Ediff} (Emacs), or the
+  \command{diff} shell command. The use of full GUI applications is
+  recommended, but what this function offers in addition is:
+
+  \itemize{
+    \item a quick diff tool that is handy during an interactive R
+    session,
+    \item a programmatic interface to analyze file differences as native
+    R objects, and
+    \item a tool that works on all platforms, regardless of what
+    software may be installed.
+  }
+
+  The \code{short} and \code{simple} defaults are designed for
+  interactive (human-readable) use, while \code{short = FALSE} and
+  \code{simple = FALSE} produces a consistent number of list elements
+  and retains longer paths.
+}
+\seealso{
+  \code{\link{diff}} is a generic function. Depending on \code{x}, it
+  will show differences between numbers, date-time objects, files,
+  directories, etc.
+
+  \code{\link{dir}}, \code{\link{readLines}}, and \code{\link{setdiff}}
+  are the underlying functions performing the file and directory
+  comparison.
+}
+\examples{
+\dontrun{
+
+# Compare two files
+write(c("We", "are", "not"), file = "one.txt")
+write(c("We", "are", "the same"), file = "two.txt")
+diff("one.txt", "two.txt")
+diff("one.txt", "two.txt", similar = TRUE)
+file.remove("one.txt", "two.txt")
+
+# Another example with two files
+x <- system.file("DESCRIPTION", package = "base")
+y <- system.file("DESCRIPTION", package = "stats")
+diff(x, y)
+diff(x, y, similar = TRUE)
+
+# Filter out noise
+diff(x, y, ignore = c("Package:", "Title:", "Description:", "Built:"))
+
+# Compare filenames in two directories
+A <- system.file(package = "base")
+B <- system.file(package = "stats")
+diff(A, B)                # these filenames are different
+diff(A, B, ignore = "^C")   # exclude entries starting with C
+diff(A, B, similar = TRUE)  # these filenames exist in both directories
+
+# Compare content of files that exist in both directories
+diff(A, B, lines = TRUE)            # the INDEX files are very different
+diff(A, B, lines = TRUE, similar = TRUE)  # but not completely different
+diff(A, B, lines = TRUE, n = 20)    # demonstrate passing n to readLines
+diffs <- diff(A, B, lines = TRUE)   # store comparison as list
+names(diffs)                        # these files are different
+str(diffs, vec.len = 1)             # first difference in each file
+
+# Alternative format
+diff(A, B, ignore = "^C")                                 # short format
+diff(A, B, ignore = "^C", short = FALSE, simple = FALSE)  # long format
+
+# Compare one file that exists in both directories
+diff(A, B, "DESCRIPTION")             # same as diffs$DESCRIPTION
+diff(A, B, "INDEX", similar = TRUE, trimws = TRUE)  # trim whitespace
+}
+}
+\keyword{file}

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Proposed diff.character() method

Reply via email to