On Thu, Apr 30, 2015 at 11:02:31PM +0200, Tim R?hsen wrote: > The top-down approach would be something like > > wget -r --extract-links | distributor host1 host2 ... hostN > > 'distributor' is a program that start one instance of wget on each host > given, > taking the (absolute) URLs via stdin, and give it to the wget instances (e.g. > via round-robin... better would be to know wether a file download has been > finished).
Yes, something like that, although not quite simple. The distributor would have to know what has just been downloaded by the worker, and invoke the link extractor on each newly-downloaded html file - in order to append the links in it to the download queue. > I assume '-r --extract-links' does not download, but just recursive > scans/extracts the existing files !? Yes, that's exactly what I had in mind. > Wget also has to be adjusted to start downloading immediately on the first > URL > read from stdin. Right now it collects all URLs until stdin closes and than > starts downloading. Ah, good point, I wasn't aware of that. > I wrote a C library for the nextgen Wget (start to move the code to wget this > autumn) with that you can also do the extraction part. There are small C > examples that you might extend to work recursive. It works with CSS and HTML. > > https://github.com/rockdaboot/mget/tree/master/examples Nice, thank you! I'll check it out :-)