On 13-09-04 8:02 PM, Karl Millar wrote:
Hi Duncan,

I think this functionality would be much easier to use and understand if
you split it up the functionality of taking snapshots and comparing them
into separate functions.

Yes, that's another possibility.  Some more comment below...


 In addition, the 'timestamp' functionality
seems both confusing and brittle to me.  I think it would be better to
store file modification times in the snapshot and use those instead of
an external file.  Maybe:

You can do that, using file.info = "mtime", but the file.info snapshots are quite a bit slower than using the timestamp file (when looking at a big recursive directory of files).


# Take a snapshot of the files.
takeFileSnapshot(directory, file.info <http://file.info> = TRUE, md5sum
= FALSE, full.names = FALSE, recursive = TRUE, ...)

# Take a snapshot using the same options as used for snapshot.
retakeFileSnapshot(snapshot, directory = snapshot$directory) {
    takeFileSnapshot)(directory, file.info <http://file.info> =
snapshot$file.info <http://file.info>, md5sum = snapshot$md5sum, etc)
}

compareFileSnapshots(snapshot1, snapshot2)
- or -
getNewFiles(snapshat1, snapshot2)       # These names are probably too
generic
getDeletedFiles(snapshot1, snapshot2)
getUpdatedFiles(snapshot1, snapshot2)
-or-
setdiff(snapshot1, snapshot2)  # Unclear how this should treat updated files


This approach does have the difficulty that users could attempt to
compare snapshots that were taken with different options and that can't
be compared, but that should be an easy error to detect.

I don't want to add too many new functions. The general R style is to have functions that do a lot, rather than have a lot of different functions to achieve different parts of related tasks. This is better for interactive use (fewer functions to remember, a simpler help system to navigate), though it probably results in less readable code.

I can see an argument for two functions (a get and a compare), but I don't think there are many cases where doing two gets and comparing the snapshots would be worth the extra runtime. (It's extra because file.info is only a little faster than list.files, and it would be unavoidable to call both twice in that version. Using the timestamp file avoids one of those calls, and replaces the other with file_test, which takes a similar amount of time. So overall it's about 20-25% faster.) It also makes the code a bit more complicated, i.e. three calls (get, get, compare) instead of two (get, compare).

Thanks for your comments.

Duncan Murdoch



Karl


On Wed, Sep 4, 2013 at 10:53 AM, Duncan Murdoch
<murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:

    In a number of places internal to R, we need to know which files
    have changed (e.g. after building a vignette).  I've just written a
    general purpose function "changedFiles" that I'll probably commit to
    R-devel.  Comments on the design (or bug reports) would be appreciated.

    The source for the function and the Rd page for it are inline below.

    ----- changedFiles.R:
    changedFiles <- function(snapshot, timestamp =
    tempfile("timestamp"), file.info <http://file.info> = NULL,
                  md5sum = FALSE, full.names = FALSE, ...) {
         dosnapshot <- function(args) {
             fullnames <- do.call(list.files, c(full.names = TRUE, args))
             names <- do.call(list.files, c(full.names = full.names, args))
             if (isTRUE(file.info <http://file.info>) ||
    (is.character(file.info <http://file.info>) && length(file.info
    <http://file.info>))) {
                 info <- file.info <http://file.info>(fullnames)
             rownames(info) <- names
                 if (isTRUE(file.info <http://file.info>))
    file.info <http://file.info> <- c("size", "isdir", "mode", "mtime")
             } else
                 info <- data.frame(row.names=names)
         if (md5sum)
             info <- data.frame(info, md5sum = tools::md5sum(fullnames))
         list(info = info, timestamp = timestamp, file.info
    <http://file.info> = file.info <http://file.info>,
              md5sum = md5sum, full.names = full.names, args = args)
         }
         if (missing(snapshot) || !inherits(snapshot,
    "changedFilesSnapshot")) {
             if (length(timestamp) == 1)
                 file.create(timestamp)
             if (missing(snapshot)) snapshot <- "."
             pre <- dosnapshot(list(path = snapshot, ...))
             pre$pre <- pre$info
             pre$info <- NULL
             pre$wd <- getwd()
             class(pre) <- "changedFilesSnapshot"
             return(pre)
         }

         if (missing(timestamp)) timestamp <- snapshot$timestamp
         if (missing(file.info <http://file.info>) || isTRUE(file.info
    <http://file.info>)) file.info <http://file.info> <-
    snapshot$file.info <http://file.info>
         if (identical(file.info <http://file.info>, FALSE)) file.info
    <http://file.info> <- NULL
         if (missing(md5sum))    md5sum <- snapshot$md5sum
         if (missing(full.names)) full.names <- snapshot$full.names

         pre <- snapshot$pre
         savewd <- getwd()
         on.exit(setwd(savewd))
         setwd(snapshot$wd)

         args <- snapshot$args
         newargs <- list(...)
         args[names(newargs)] <- newargs
         post <- dosnapshot(args)$info
         prenames <- rownames(pre)
         postnames <- rownames(post)

         added <- setdiff(postnames, prenames)
         deleted <- setdiff(prenames, postnames)
         common <- intersect(prenames, postnames)

         if (length(file.info <http://file.info>)) {
             preinfo <- pre[common, file.info <http://file.info>]
             postinfo <- post[common, file.info <http://file.info>]
             changes <- preinfo != postinfo
         }
         else changes <- matrix(logical(0), nrow = length(common), ncol = 0,
                                dimnames = list(common, character(0)))
         if (length(timestamp))
             changes <- cbind(changes, Newer = file_test("-nt", common,
    timestamp))
         if (md5sum) {
             premd5 <- pre[common, "md5sum"]
             postmd5 <- post[common, "md5sum"]
         changes <- cbind(changes, md5sum = premd5 != postmd5)
         }
         changes1 <- changes[rowSums(changes, na.rm = TRUE) > 0, , drop
    = FALSE]
         changed <- rownames(changes1)
         structure(list(added = added, deleted = deleted, changed = changed,
             unchanged = setdiff(common, changed), changes = changes),
    class = "changedFiles")
    }

    print.changedFilesSnapshot <- function(x, ...) {
         cat("changedFiles snapshot:\n timestamp = \"", x$timestamp,
    "\"\n file.info <http://file.info> = ",
             if (length(x$file.info <http://file.info>))
    paste(paste0('"', x$file.info <http://file.info>, '"'), collapse=","),
             "\n md5sum = ", x$md5sum, "\n args = ", deparse(x$args,
    control = NULL), "\n", sep="")
         x
    }

    print.changedFiles <- function(x, ...) {
         if (length(x$added)) cat("Files added:\n",  paste0("  ",
    x$added, collapse="\n"), "\n", sep="")
         if (length(x$deleted)) cat("Files deleted:\n",  paste0("  ",
    x$deleted, collapse="\n"), "\n", sep="")
         changes <- x$changes
         changes <- changes[rowSums(changes, na.rm = TRUE) > 0, ,
    drop=FALSE]
         changes <- changes[, colSums(changes, na.rm = TRUE) > 0,
    drop=FALSE]
         if (nrow(changes)) {
             cat("Files changed:\n")
             print(changes)
         }
         x
    }
    ----------------------

    --- changedFiles.Rd:
    \name{changedFiles}
    \alias{changedFiles}
    \alias{print.changedFiles}
    \alias{print.__changedFilesSnapshot}
    \title{
    Detect which files have changed
    }
    \description{
    On the first call, \code{changedFiles} takes a snapshot of a
    selection of files.  In subsequent
    calls, it takes another snapshot, and returns an object containing
    data on the
    differences between the two snapshots.  The snapshots need not be
    the same directory;
    this could be used to compare two directories.
    }
    \usage{
    changedFiles(snapshot, timestamp = tempfile("timestamp"), file.info
    <http://file.info> = NULL,
                  md5sum = FALSE, full.names = FALSE, ...)
    }
    \arguments{
       \item{snapshot}{
    The path to record, or a previous snapshot.  See the Details.
    }
       \item{timestamp}{
    The name of a file to write at the time the initial snapshot
    is taken.  In subsequent calls, modification times of files will be
    compared to
    this file, and newer files will be reported as changed.  Set to
    \code{NULL}
    to skip this test.
    }
       \item{file.info <http://file.info>}{
    A vector of columns from the result of the \code{file.info
    <http://file.info>} function, or a logical value.  If
    \code{TRUE}, columns \code{c("size", "isdir", "mode", "mtime")} will
    be used.  Set to
    \code{FALSE} or \code{NULL} to skip this test.  See the Details.
    }
       \item{md5sum}{
    A logical value indicating whether MD5 summaries should be taken as
    part of the snapshot.
    }
       \item{full.names}{
    A logical value indicating whether full names (as in
    \code{\link{list.files}}) should be
    recorded.
    }
       \item{\dots}{
    Additional parameters to pass to \code{\link{list.files}} to control
    the set of files
    in the snapshots.
    }
    }
    \details{
    This function works in two modes.  If the \code{snapshot} argument
    is missing or is
    not of S3 class \code{"changedFilesSnapshot"}, it is used as the
    \code{path} argument
    to \code{\link{list.files}} to obtain a list of files.  If it is of
    class
    \code{"changedFilesSnapshot"}, then it is taken to be the baseline file
    and a new snapshot is taken and compared with it.  In the latter
    case, missing
    arguments default to match those from the initial snapshot.

    If the \code{timestamp} argument is length 1, a file with that name
    is created
    in the current directory during the initial snapshot, and
    \code{\link{file_test}}
    is used to compare the age of all files to it during subsequent calls.

    If the \code{file.info <http://file.info>} argument is \code{TRUE}
    or it contains a non-empty
    character vector, the indicated columns from the result of a call to
    \code{\link{file.info <http://file.info>}} will be recorded and
    compared.

    If \code{md5sum} is \code{TRUE}, the \code{tools::\link{md5sum}}
    function
    will be called to record the 32 byte MD5 checksum for each file, and
    these values
    will be compared.
    }
    \value{
    In the initial snapshot phase, an object of class
    \code{"changedFilesSnapshot"} is returned.  This
    is a list containing the fields
    \item{pre}{a dataframe whose rownames are the filenames, and whose
    columns contain the
    requested snapshot data}
    \item{timestamp, file.info <http://file.info>, md5sum, full.names}{a
    record of the arguments in the initial call}
    \item{args}{other arguments passed via \code{...} to
    \code{\link{list.files}}.}

    In the comparison phase, an object of class \code{"changedFiles"}.
    This is a list containing
    \item{added, deleted, changed, unchanged}{character vectors of
    filenames from the before
    and after snapshots, with obvious meanings}
    \item{changes}{a logical matrix with a row for each common file, and
    a column for each
    comparison test.  \code{TRUE} indicates a change in that test.}

    \code{\link{print}} methods are defined for each of these types. The
    \code{\link{print}} method for \code{"changedFilesSnapshot"} objects
    displays the arguments used to produce it, while the one for
    \code{"changedFiles"} displays the \code{added}, \code{deleted}
    and \code{changed} fields if non-empty, and a submatrix of the
    \code{changes}
    matrix containing all of the \code{TRUE} values.
    }
    \author{
    Duncan Murdoch
    }
    \seealso{
    \code{\link{file.info <http://file.info>}}, \code{\link{file_test}},
    \code{\link{md5sum}}.
    }
    \examples{
    # Create some files in a temporary directory
    dir <- tempfile()
    dir.create(dir)
    writeBin(1, file.path(dir, "file1"))
    writeBin(2, file.path(dir, "file2"))
    dir.create(file.path(dir, "dir"))

    # Take a snapshot
    snapshot <- changedFiles(dir, file.info <http://file.info>=TRUE,
    md5sum=TRUE)

    # Change one of the files
    writeBin(3, file.path(dir, "file2"))

    # Display the detected changes
    changedFiles(snapshot)
    changedFiles(snapshot)$changes
    }
    \keyword{utilities}
    \keyword{file}

    ________________________________________________
    R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/r-devel
    <https://stat.ethz.ch/mailman/listinfo/r-devel>



______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to