We're finding a lot of our jobs are getting stuck in a state where Thermos is repeatedly retrying failed processes.
I ran through one of these with Brian Wickman, who noted in that particular case that the processes in question was exiting with -6 SIGABRT, which Thermos doesn't think is a fatal enough signal to be concerned with and will retry. I'm now seeing another process that seems to being killed with -9 SIGKILL, and Thermos is still restarting it; the following pulled from thermos_runner.DEBUG: https://gist.github.com/helgridly/e4413fd01d45b8c6d1c0 So: what's going on here? All our tasks are marked as max_failures = 1, but that doesn't seem to be preventing Thermos from retrying processes. It looks like what's happening is that Thermos is interpreting various kill signals as "lost" rather than failed, and retrying without incrementing the failure count. What I can't find is what's calling on_killed in runner.py. Nor can I figure out what to do about any of this. Brian and I talked about wrapping all commands in a shell script that exits 0 if its child command did and 1 in all other cases. While this might work, I don't really understand why this is necessary; can someone explain the reasoning behind Thermos ever deciding a process got "lost" rather than simply failed? Thanks, Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard
