On Tue, Sep 25, 2012 at 3:47 AM, Joseph White <[email protected]> wrote:
:
> cat url-list | parallel --eta --progress --joblog jobnew.log -j0 ./linkcheck
> {} >> errors.log
:
> Here is an example of what the url-list contains:
>
> http://www.hairforsale.com
> http://www.rdhjobs.com
> http://www.gdha.org
> http://www.hotdogsafari.com
>
> Would using the pipe command speed up the process significantly?

The --pipe completely changes the way GNU Parallel works. Just like
'xargs' and 'cat' are not the same and cannot be used to solve the
same problem. parallel (without --pipe) is similar to xargs, parallel
--pipe is similar to cat. But using parallel in --pipe mode to start
parallel without --pipe might be the solution, see below.

> Also would the -u option speed up the process?

-u will speed up parallel, but it comes at a price: The output can be
mixed up and thus is no good for further processing. Since you save
the output I will advice against this.

I assume that you have measured and that you actually see that GNU
Parallel is the limiting factor in your process (i.e. it is using 100%
CPU). So what is limiting you is GNU Parallel's ability to spawn jobs
quickly enough. What you can do is to split up the url-list into
chunks and run multiple GNU Parallels in parallel. And you can of
course use GNU Parallel to do just that:

  cat urllist | parallel -j10 --pipe parallel -j0 ./linkcheck >>  errors.log

This will read urllist in chunks of 1 MB and pass that to the second
parallel which will spawn linkcheck for each of the lines.

If -j0 normally spawns 500 jobs, then the above will spawn 5000 jobs.

You can adjust -j10, but be warned: using -j0 instead is likely to
kill your machine, as that will try to spawn 250000 jobs.


/Ole

Reply via email to