Re: Using hash instead of timestamps to check for changes.

Paul Smith Fri, 27 Mar 2015 08:45:41 -0700

On Fri, 2015-03-27 at 14:42 +0100, Glen Stark wrote:
> Is this planned?  Has the idea already been rejected, and if so could
> you point me to the discussion so I can inform myself?


There is no formal planning around it right now, and it's not at the top
of my TODO list for GNU make.

> If it is planned, or you agree it's worth doing, how can I help?  I'm
> willing to write the code if someone is willing to help me work into the
> code a little.  Until now I'm only a user, not maintainer of Make, and
> would need some tips about how to fit the functionality into the overall
> design of Make.  Someone to bounce ideas off, and direct questions to
> would be wonderful.  If someone else is working on it already, I'd like
> to help however I can -- testing, debugging, etc.

I'm not aware of anyone working on it.  It sounds like a simple thing,
but actually there are a lot of issues that need to be considered before
any implementation can be started.  The important thing to remember is
that currently make is completely stateless... or rather, it uses the
filesystem to maintain its state (in the form of modification times).
Any change to a method of determining "out-of-date-ness" such as a hash
of the file content means introducing a separate state that make has to
maintain: this adds a lot of complexity and corner cases to work
through.

Before anyone can consider writing code of this magnitude, they should
familiarize themselves with the FSF's requirements for contributing to
the GNU project; you'll need to assign copyright to the FSF for the work
contributed to GNU make, which involves some legal paperwork on your
part and, if your employer has rights to your work which most do, at
least in the U.S., even if you don't do the work on the job, your
employer will have to agree as well.

On the technical side, there are various things to consider:
      * What form will the extra state be kept in?  One file per
        directory?  One file per target?  Something else?
      * If we use one file per target things are simpler, although that
        adds up to a LOT of files in bigger builds and some platforms
        might have problems.
      * If we use one file per directory, there are lots of issues:
              * When is the file written?  Every time a target is
                updated?  Once at the end of the build?
              * How will make handle the state file if it's killed in
                the middle of a build?
              * How will make handle missing/corrupted state files?
                Will it fall back on modification times, or just rebuild
                everything?
              * How do we handle recursion, where multiple instances of
                make could be running in the same directory?
      * We need to consider platform-specific issues; for example on
        UNIX systems a cheap/fast method of keeping per-file metadata
        might be to make a symbolic link containing the data, but that
        won't work on Windows or VMS, etc.
      * What type of extra state will we use?  My suspicion is that
        md5sum is not the best.  We don't really need it: we want
        fingerprinting not a cryptographic hash.  We don't even need to
        do de-dup so we won't run into the birthday paradox: we only
        want to know if the file has changed since the last time we saw
        it.  Probably a straightforward, well-distributed hash like
        xxhash would be sufficient.  If you combine both mod time AND
        the hash that's pretty definitive; you can probably get away
        with a 32bit hash.
      * What are the performance implications?  You're committing to
        having make read the entire content of every single file
        involved in the build into memory, just to decide what to
        update!  That's definitely going to hurt: a simple "nothing to
        do" build will suffer a big performance penalty.  In fact, in a
        way the fewer jobs make needs to run the slower it will be,
        since it will have to check the hash of every target where the
        mod time doesn't give an answer.  Maybe the hashing could be
        done per-block instead of on the entire file so you could fail
        faster, or something.  But now you're storing more state per
        target (multiple hashes per target).
      * Do we really need to hash the file?  Maybe simply expanding the
        current checking is sufficient.  For example, if in addition to
        mod time we also considered the size of the file (and maybe
        other things maintained by the filesystem like inode, for tools
        which don't just overwrite the same file) we could increase our
        accuracy WITHOUT resorting to a separate state file.  Is that
        good enough?
      * What if people want to define their own "out-of-date-ness" test?
        Maybe someone wants to integrate with inotify, or they want to
        check the preprocessor output so that files are not considered
        changed just because a comment changes, or something.


_______________________________________________
Bug-make mailing list
Bug-make@gnu.org
https://lists.gnu.org/mailman/listinfo/bug-make

Re: Using hash instead of timestamps to check for changes.

Reply via email to