Re: Catch OOM kills

Ole Tange Sun, 06 May 2018 23:58:37 -0700

On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani
<[email protected]> wrote:
> On Fri, May 4, 2018 at 8:42 AM, John <[email protected]> wrote:
>>
>> How can I catch if the program I have called with parallel gets killed by 
>> the kernel due to memory space.
:
>> I would like to have an option that returns me all the jobs that were not 
>> able to be finished. Is this possible?


> You can use parallel --joblog ~/my.log to output several pieces of 
> information about jobs. One of those pieces is "ExitVal", which will tell you 
> not only that your job completed unsuccessfully, but with what exit code. For 
> example, instead of having to check dmesg for a "Out of memory: Kill process 
> ..." message, you can safely assume 143 is from linux's OOM killer having 
> sent your process a SIGTERM (128 + 15).
>
> I usually run an ad hoc script to pick up the "stragglers" after a larger 
> run, by parsing that file for any non-zero ExitVal's, and re-invoking the 
> full command line associated with it. Of course, if the exit code was due to 
> something deterministic, you'll just get non-zeros again and again, without 
> first fixing the problem with the data/args of those specific invocations 
> first.

I would reckon that is a good approach. If your jobs have very varying
memory usage, then first run a lot of them in parallel:

  parallel --joblog my.log -j100% [...]

When that is done run all failed jobs again, but run only a single job
at a time to give it the most memory available:

  parallel --retry-failed --joblog my.log -j1

or:

  parallel --resume-failed --joblog my.log -j1 [...]

This last part is basically Douglas' ad hoc script.

The difference between --retry-failed and --resume-failed is described
in the man page.


/Ole

Re: Catch OOM kills

Reply via email to