This small script will be an example in next version. It is a parallel webspider (walking breadth first).
As it is going to be an example I need to know if I need to explain anything in the script or if it is clear what is going on. Run like this: PARALLEL=-j50 ./parallel-spider http://www.gnu.org/software/parallel If you can change it to be a parallel webmirroring tool (similar to wget -m), then that would be great. I gave up after trying for 30 mins. /Ole #!/bin/bash # E.g. http://www.gnu.org/software/parallel URL=$1 URLLIST=$(mktemp urllist.XXXX) URLLIST2=$(mktemp urllist.XXXX) SEEN=$(mktemp seen.XXXX) # Spider to get the URLs echo $URL >$URLLIST cp $URLLIST $SEEN while [ -s $URLLIST ] ; do cat $URLLIST | parallel lynx -listonly -image_links -dump {} \; echo Spidered: {} \>\&2 | perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' | grep -F $URL | grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2 mv $URLLIST2 $URLLIST done rm -f $URLLIST $URLLIST2 $SEEN
