> Date: Thu, 22 May 2014 12:28:22 +0100 > From: Pádraig_Brady <p...@draigbrady.com> > Subject: Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD'
> join -j2 <(stat -c '%s %n' /bin/ls /bin/cp | sort) <(sha1sum /bin/cp > /bin/ls | sort) > tr '\n' '\1' | > sort | > uniq -u ... Your remarks are correct iff stat and sha1sum output *are* able to produce consistently joinable outputs. However when attempting to employ such usage patterns into *generally usable scripts*, one has to take care of possible inconsistencies (leading to bugs!) occurring when file names contain SPACE, TAB, NL and other such chars. A solution would be to impose TAB only as field separator -- thus ensuring that it cannot appear anywhere else. Then one might invoke join with "-t $'\t'". With this condition, it should be clearer why the need of '--quoting-style=escape' and '--digest-type=sha1' options and of '%S' format specifier for stat. > There is no advantage of supporting this option in stat > as that is only useful when a command needs to process all > file names in a _single invocation_, like when sorting or accumulating etc. > For stat one can efficiently: > > find ... -print 0 | xargs -r0 stat ... > > or > > find ... -exec stat {} + One meaningful reason for single invocation is efficiency. The input to stat can be huge (and in my initially evoked scenario in fact often is!) -- and that possible large amount of data propagates downward the multiple pipelines and fifos of your scenario above. > Note also that sort has the --zero-terminated option, as do newer versions of > join and uniq. The fanciful '-0|--null' options refers to both input and output of sort. The existing '-z|--zero-terminated' -- only to sort's output. > This could be useful, however there is already the %N option for quoted file > name. > > $ stat -c %N /bin/ls > ‘/bin/ls’ > $ LANG=C src/stat -c %N /bin/ls > '/bin/ls' Recall the claimed consistency from above. In case of symlinks, %N produces output like the one below: $ touch /tmp/foo $ ln -sv /tmp/foo /tmp/bar `/tmp/bar' -> `/tmp/foo' $ stat -c %N /tmp/bar `/tmp/bar' -> `/tmp/foo' $ Also, in case of symlinks, the digest sum computing programs do follow the links, i.e. they actually compute digests for the content of the file to which the symlink file points to: $ sha1sum /tmp/foo /tmp/bar da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/foo da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/bar The semantics of %S in the proposed patches is different however: the new stat produces the digest of the *content* of the file itself. In case of symlinks that content is obtained via 'areadlink_with_size': $ stat2 -c '%S %n' /tmp/foo /tmp/bar da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/foo 469150566bd728fc90b4adf6495202fd70ec3537 /tmp/bar Note that the STAT_* files of my initial usage scenario do have an intrinsic value of themselves -- not only that of providing the means for verifying the correctness of making ISO files or of burning DVDs. These files keep a quite faithful record of content of the file system itself. With many thanks for your thorough response, Stefan Vargyas.