On 03/25/2014 02:12 PM, xeon Mailinglist wrote:
> For each file inside the directory $output, I do a cat to the file and 
> generate a sha256 hash. This script takes 9 minutes to read 105 files, with 
> the total data of 556MB and generate the digests. Is there a way to make this 
> script faster? Maybe generate digests in parallel?
> 
> for path in $output
> do
>     # sha256sum
>     digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | 
> awk '{ print $1 }')
>     (( count ++ ))
> done

This is not a bach question so please ask in a more appropriate user
oriented rather than developer oriented list in future.
Off the top of my head I'd do something like the following to get xargs to 
parallelize:

digests=( $(
 find "$output" -type f |
 xargs -I '{}' -n1 -P$(nproc) \
 sh -c "$HADOOP_HOME/bin/hdfs dfs -cat '{}' | sha256sum" |
 cut -f1 -d' '
) )

You might want to distribute that load across systems too
with something like dxargs or perhaps something like hadoop :p

thanks,
Pádraig.

Reply via email to