On Tue, Sep 25, 2012 at 3:47 AM, Joseph White <[email protected]> wrote: : > cat url-list | parallel --eta --progress --joblog jobnew.log -j0 ./linkcheck > {} >> errors.log : > Here is an example of what the url-list contains: > > http://www.hairforsale.com > http://www.rdhjobs.com > http://www.gdha.org > http://www.hotdogsafari.com > > Would using the pipe command speed up the process significantly?
The --pipe completely changes the way GNU Parallel works. Just like 'xargs' and 'cat' are not the same and cannot be used to solve the same problem. parallel (without --pipe) is similar to xargs, parallel --pipe is similar to cat. But using parallel in --pipe mode to start parallel without --pipe might be the solution, see below. > Also would the -u option speed up the process? -u will speed up parallel, but it comes at a price: The output can be mixed up and thus is no good for further processing. Since you save the output I will advice against this. I assume that you have measured and that you actually see that GNU Parallel is the limiting factor in your process (i.e. it is using 100% CPU). So what is limiting you is GNU Parallel's ability to spawn jobs quickly enough. What you can do is to split up the url-list into chunks and run multiple GNU Parallels in parallel. And you can of course use GNU Parallel to do just that: cat urllist | parallel -j10 --pipe parallel -j0 ./linkcheck >> errors.log This will read urllist in chunks of 1 MB and pass that to the second parallel which will spawn linkcheck for each of the lines. If -j0 normally spawns 500 jobs, then the above will spawn 5000 jobs. You can adjust -j10, but be warned: using -j0 instead is likely to kill your machine, as that will try to spawn 250000 jobs. /Ole
