Follow-up Comment #12, bug #20808 (project wget): I try to retrieve specific replays from saved game storage http://replays.wesnoth.org/1.12/
This site just usual directory/file list. As data grouped per day for 2 year period there are a lot of subdirectories. I try to get interesting replays by (see http://forums.wesnoth.org/viewtopic.php?p=588686#p588686 ): wget -e 'robots=off' -nc -c -np -A 'Scrolling_Survival_Turn_1??_*.bz2' -A index.html -r http://replays.wesnoth.org/1.12/ but each subdirectory have links to sort table data on page (query string) and for each page (which is 2 years*365 days) it try to download things that rejected. It take too long time to wait (even given that wget reuse connections) for wget do useless job. I quickly solve task with by manually scanning index.html files, just get them by wget (--level=1 do job for limiting amount of processing time): $ wget -r -np -A index.html --level=1 http://replays.wesnoth.org/1.12/ and retrieve interested files: $ find . -type f -name index.html | while read f; do p=${f#./}; p=http://${p%index.html}; command grep -o 'href="Scrolling_Survival_Turn_[5-9]._[^"]*.bz2' $f | while read s; do s=${s#href='"'}; wget $p$s; done; done It is nice to have ability to list what links to follow, when processed HTML files. _______________________________________________________ Reply to this item at: <http://savannah.gnu.org/bugs/?20808> _______________________________________________ Message sent via/by Savannah http://savannah.gnu.org/
