On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani <[email protected]> wrote: > On Fri, May 4, 2018 at 8:42 AM, John <[email protected]> wrote: >> >> How can I catch if the program I have called with parallel gets killed by >> the kernel due to memory space. : >> I would like to have an option that returns me all the jobs that were not >> able to be finished. Is this possible?
> You can use parallel --joblog ~/my.log to output several pieces of > information about jobs. One of those pieces is "ExitVal", which will tell you > not only that your job completed unsuccessfully, but with what exit code. For > example, instead of having to check dmesg for a "Out of memory: Kill process > ..." message, you can safely assume 143 is from linux's OOM killer having > sent your process a SIGTERM (128 + 15). > > I usually run an ad hoc script to pick up the "stragglers" after a larger > run, by parsing that file for any non-zero ExitVal's, and re-invoking the > full command line associated with it. Of course, if the exit code was due to > something deterministic, you'll just get non-zeros again and again, without > first fixing the problem with the data/args of those specific invocations > first. I would reckon that is a good approach. If your jobs have very varying memory usage, then first run a lot of them in parallel: parallel --joblog my.log -j100% [...] When that is done run all failed jobs again, but run only a single job at a time to give it the most memory available: parallel --retry-failed --joblog my.log -j1 or: parallel --resume-failed --joblog my.log -j1 [...] This last part is basically Douglas' ad hoc script. The difference between --retry-failed and --resume-failed is described in the man page. /Ole
