Karl Vogel wrote:
> On Wed, Aug 30, 2023 at 07:55:14AM -0400, songbird wrote:
>> Karl Vogel wrote:
>> ...
>> > If nothing else, it's faster to run "locate" and look for file extensions;
>> > running "file" on that much crap took nearly 9 hours.
>> 
>> do you have SSDs or spinning rust?
>
>   I have a 256-Gb SSD and two mirrored Western Digital Blue 1.8-Tb drives.
>   About 2 million files are on SSD and the rest are on rust.
>
>   I used "file" v5.45 built from source, which does a nice job but is IO-
>   and CPU-intensive.

  mirroring is going to be quite a difference, especially 
if you are updating each file's access time (see below).


>> when i just did this:
>>     # find / -type f | wc -l
>> it took all of 24 seconds for the 2.4 million files found.
>
>   Generating hashes for SSD files is faster than getting the filetype;
>   it takes about 17 minutes for 3.6 million files (153 Gbytes).  I like
>   the Blake-2 hash cuz it's fast as hell, among other things:
>
>     #!/bin/ksh
>     #<zroot-hash: run Blake hash on all zroot dataset files
>
>     export PATH=/usr/local/bin:/bin:/usr/bin
>     tag=${0##*/}
>     set -o nounset
>     umask 022
>
>     logmsg () { logger -t "$tag" "$@"; }
>     die ()    { logmsg "FATAL: $@"; exit 1; }
>
>     work=$(mktemp -q "/tmp/$tag.work.XXXXXX")
>     case "$?" in
>         0)  test -f "$work" || die "$work: tmp list file not found" ;;
>         *)  die "can't create work file" ;;
>     esac
>
>     # Get a list of all regular files on SSD.
>
>     mount | grep '^zroot' | awk '{print $3}' |
>       while read dataset
>       do
>           logmsg "listing $dataset"
>           find "$dataset" -xdev -type f -print0 >> $work
>       done
>
>     # Store hashes for SSD datasets.
>     # The hash file is sorted by filename to make comparisons easier.
>
>     logmsg "running b2sum"
>     fdbdir=$(date '+/var/fdb/%Y/%m%d')
>     sort -z $work | xargs -0r b2sum -l 128 > "$fdbdir/zroot.sum"
>     rm $work
>     exit 0
>
>   Useful for finding changed files -- security, backups, etc.
>
>> what script did you use?
>
>     #!/bin/ksh
>     #<ftype: get a sampling of filetypes for all SSD filesystems, /src.
>
>     export PATH=/usr/local/bin:/bin:/usr/bin
>     set -o nounset
>     tag=${0##*/}
>     umask 022
>
>     logmsg () { logger -t "$tag" "$@"; }
>
>     work="/tmp/$tag.$$"
>     fsys="/ /doc /home /usr/local /search /usr/src /dist /src"
>
>     logmsg start
>     find $fsys -xdev -print0 | xargs -0 file -N --mime-type > $work
>     logmsg finish
>
>     mv $work filetypes
>     exit 0

  interesting and thanks for that.  :)

  my comments that follow are geared towards finding the
files that have been referenced and changed only in recent
times so that you are not having to process the entire 
file system.

  so the find statement would be adjusted to use -cmin or 
-amin (depending upon if i want to find changes or accessed 
files) and the file command would include the parameter 
flag -p to avoid updating the access time.


  as for the other topic of finding changed files and using
hashes is a whole different topic and one that i don't need.
this is the sort of thing that git can do so i would not
want to reinvent that if i don't really need to (which at
present i don't).  i keep track of certain directories and
that's all i need.  for directories or file systems that i
need read only i use the read only mount feature or set
permissions.


  songbird

Reply via email to