[Bug-wget] [bug #20808] -R should reject files _before_ downloading them

Oleksandr Gavenko Fri, 21 Aug 2015 09:20:09 -0700

Follow-up Comment #12, bug #20808 (project wget):

I try to retrieve specific replays from saved game storage
http://replays.wesnoth.org/1.12/


This site just usual directory/file list.

As data grouped per day for 2 year period there are a lot of subdirectories.

I try to get interesting replays by (see
http://forums.wesnoth.org/viewtopic.php?p=588686#p588686 ):

wget -e 'robots=off' -nc -c -np -A 'Scrolling_Survival_Turn_1??_*.bz2' -A
index.html -r http://replays.wesnoth.org/1.12/

but each subdirectory have links to sort table data on page (query string) and
for each page (which is 2 years*365 days) it try to download things that
rejected.

It take too long time to wait (even given that wget reuse connections) for
wget do useless job.

I quickly solve task with by manually scanning index.html files, just get them
by wget (--level=1 do job for limiting amount of processing time):

$ wget -r -np -A index.html --level=1 http://replays.wesnoth.org/1.12/

and retrieve interested files:

$ find . -type f -name index.html | while read f; do p=${f#./};
p=http://${p%index.html}; command grep -o
'href="Scrolling_Survival_Turn_[5-9]._[^"]*.bz2' $f | while read s; do
s=${s#href='"'}; wget $p$s; done; done

It is nice to have ability to list what links to follow, when processed HTML
files.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?20808>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Bug-wget] [bug #20808] -R should reject files _before_ downloading them

Reply via email to