On 13-09-06 11:07 PM, Karl Millar wrote:
On Fri, Sep 6, 2013 at 7:03 PM, Duncan Murdoch <murdoch.dun...@gmail.com> wrote:
On 13-09-06 9:21 PM, Karl Millar wrote:

Hi Duncan,

I like the interface of this version a lot better, but there's still a
bunch of implementation details that need fixing:

* As previously mentioned, there are important cases where the mtime
values change in ways that this code doesn't detect.
* If the timestamp file (which is usually in the temp directory) gets
deleted (which can happen after a moderate amount of time of
inactivity on some systems), then the file_test('-nt', ...) will
always return false, even if the file has changed.


If that happened without user intervention, I think it would break other
things in R -- the temp directory is supposed to last for the whole session.
But I should be checking anyway.

Yes, it does break other things in R -- my experience has been that
the help system seems to be the one that is impacted the most by this.
  FWIW, I've never seen the entire R temp directory deleted, just
individual files and subdirectories in it, but even that probably
depends on how the machine is configured.  I suspect only a few users
ever notice this, but my R use is probably somewhat anomalous and I
think it only happens to R sessions that I haven't used for a few
days.

I use Windows and never see this; deleting temp files is up to me, not to the system. But my understanding was the *nix systems should only clean up /tmp on restart, and I don't think an R session will survive a restart.

However, you have convinced me that the use of the timestamp file is not beneficial enough to be the default. I'll leave it as an option, but add warnings that it might be unreliable.


* If files get added or deleted between the two calls to list.files in
fileSnapshot, it will fail with an error.


Yours won't work if path contains more than one directory.  This is probably
a reasonable restriction, but it's inconsistent with list.files, so I'd like
to avoid it if I can find a way.

I'm currently unsure what the behaviour when comparing snapshots with
multiple directories should be.

Presumably we should have the property that (horribly abusing notation
for succinctness):
       compareSnapshots(c(a1, a2),  c(a1, a2))
is the same as concatenating (in some form)
       compareSnapshots(a1, a1) and compareSnapshots(a2, a2)
and there's a bunch of ways we could concatenate -- we could return a
list of results, or a single result where each of the 'added, deleted,
modified' fields are a list, or where we concatenate the 'added,
deleted, modified' fields together into three simple vectors.
Concatenating the vectors together like this is appealing, but unless
you're using the full names, it doesn't include the information of
which directory the changes are in, and using the full names doesn't
work in the case where you're comparing different sets of directories,
e.g. compareSnapshots(c(a1, a2), c(b1, b2)), where there is no
sensible choice for a full name.  The list options don't have this
problem, but are harder to work with, particularly for the common case
where there's only a single directory.  You'd also have to be somewhat
careful with filenames that occur in both directories.

Maybe I'm just being dense, but I don't see a way to do this thats
clear, easy to use and wouldn't confuse users at the moment.

The way I've done this is to require full.names when multiple dirs are on the path. I've reduced it to one list.files() call per dir, by iterating over the path variable and using your approach of calling it with full.names = FALSE, then adding the dir if necessary.

I haven't adopted your change that forces comparison of only size and mtime from file.info. I don't see a big cost in storing whatever file.info returns (which is system dependent; on Windows I don't see the user and group related columns; on Unix I don't see the exe column). Users might want to detect changes to anything there, and I shouldn't make it harder for them.

I've also kept the special-casing of md5sum; it really needs to be wrapped in suppressWarnings() (on Unix only). And I've kept the options to specify what changedFiles checks among the file.info columns; I can see that you might want a snapshot with everything, but sometimes only want to be told about changes in a subset of the attributes.

I've uploaded <http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.1.tar.gz> if anyone is interested.

Duncan Murdoch


Karl

Duncan Murdoch


* If the path is on a remote file system, tempdir is local, and
there's significant clock skew, then you can get incorrect results.

Unfortunately, these aren't just theoretical scenarios -- I've had the
misfortune to run up against all of them in the past.

I've attached code that's loosely based on your implementation that
solves these problems AFAICT.  Alternatively, Hadley's code handles
all of these correctly, with the exception that compare_state doesn't
handle the case where safe_digest returns NA very well.

Regards,

Karl

On Fri, Sep 6, 2013 at 5:40 PM, Duncan Murdoch <murdoch.dun...@gmail.com>
wrote:

On 13-09-06 7:40 PM, Scott Kostyshak wrote:


On Fri, Sep 6, 2013 at 3:46 PM, Duncan Murdoch
<murdoch.dun...@gmail.com>
wrote:


On 06/09/2013 2:20 PM, Duncan Murdoch wrote:



I have now put the code into a temporary package for testing; if
anyone
is interested, for a few days it will be downloadable from

fisher.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz




Sorry, error in the URL.  It should be

http://www.stats.uwo.ca/faculty/murdoch/temp/testpkg_1.0.tar.gz



Works well. A couple of things I noticed:

(1)
md5sum is being called on directories, which causes warnings. (If this
is not viewed as undesirable, please ignore the rest of this comment.)
Should this be the responsibility of the user (by passing arguments to
list.files)? In the example, changing
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE)
to
fileSnapshot(dir, file.info=TRUE, md5sum=TRUE, include.dirs=FALSE,
recursive=TRUE")

gets rid of the warnings. But perhaps the user just wants to exclude
directories for the md5sum calculations. This can't be controlled from
fileSnapshot.



I don't see the warnings, I just get NA values.  I'll try to see why
there's
a difference.  (One possibility is my platform (Windows); another is that
I'm generally testing in R-patched and R-devel rather than the 3.0.1
release
version.)  I would rather suppress the warnings than make the user avoid
them.



Or, should the "if (md5sum)" chunk subset "fullnames" using file_test
or file.info to exclude directories (and then fill in the directories
with NA)?

(2)
If I run example(changedFiles) several times, sometimes I get:

chngdF> changedFiles(snapshot)
File changes:
         mtime md5sum
file2  TRUE   TRUE

and other times I get:

chngdF> changedFiles(snapshot)
File changes:
         md5sum
file2   TRUE

I wonder why.



Sometimes the example runs so quickly that the new version has exactly
the
same modification time as the original.  That's the risk of the mtime
check.
If you put a delay between, you'll get consistent results.

Duncan Murdoch



Scott

sessionInfo()


R Under development (unstable) (2013-08-31 r63780)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
    [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] testpkg_1.0

loaded via a namespace (and not attached):
[1] tools_3.1.0





--
Scott Kostyshak
Economics PhD Candidate
Princeton University


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to