On 03/25/2014 02:12 PM, xeon Mailinglist wrote: > For each file inside the directory $output, I do a cat to the file and > generate a sha256 hash. This script takes 9 minutes to read 105 files, with > the total data of 556MB and generate the digests. Is there a way to make this > script faster? Maybe generate digests in parallel? > > for path in $output > do > # sha256sum > digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | > awk '{ print $1 }') > (( count ++ )) > done
This is not a bach question so please ask in a more appropriate user oriented rather than developer oriented list in future. Off the top of my head I'd do something like the following to get xargs to parallelize: digests=( $( find "$output" -type f | xargs -I '{}' -n1 -P$(nproc) \ sh -c "$HADOOP_HOME/bin/hdfs dfs -cat '{}' | sha256sum" | cut -f1 -d' ' ) ) You might want to distribute that load across systems too with something like dxargs or perhaps something like hadoop :p thanks, Pádraig.