I have implemented --halt 10% in git to mean: 10% of the jobs run so far must fail and at least 3. The 3 was necessary to avoid too many false positive is the percentage is 50% or higher.
Feedback welcome. /Ole On Sat, Jul 19, 2014 at 1:17 AM, Ole Tange <[email protected]> wrote: > On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme <[email protected]> wrote: > >> There are currently three options to "—halt" - ignore (0), stop new jobs >> (1), or kill everything (2). >> >> I propose an additional option; to set the number of job failures before >> doing anything. This would then allow some tolerance of failure but would >> catch global problems. >> >> Consider this example - running a 1000 jobs each of around 1hr, where a >> random handful will fail due to unexpected bad data or other unforeseen bug, >> but the overwhelming majority will complete successfully. >> >> Setting —halt 0 all jobs will run, and I can check for the failures >> afterwards. Great! However, say I forget to create the results directory, so >> every "good" job runs for full time then fails right at the end…if I wasn’t >> monitoring I just wasted 1000hrs of processing time. > > This I do not understand. GNU Parallel 20140622 creates the dirs > before running, so your version is broken: > > $ parallel --results /tmp/this/does/not/exist echo ::: 1 > 1 > $ ls /tmp/this/does/not/exist/1/1/ > stderr stdout > >> Setting halt > 0 the job will stop at or just after the first problem. I >> have to check the logs, figure out and fix if possible, rerun with previous >> success excluded etc. > > Using --resume-failed. > >> What I would like is to say set the number of tolerable failures to the >> number of workers. Then a serious bug would be caught after the first >> iteration, but the entire job would run and handle some measure of bad input >> data. > > You need to give a reproducible example where you cannot just use > --halt 0 and then later --resume-failed when you have fixed the > bug/the input data. > >> Does this make sense? Unfortunately it would require changing the current >> flags, either adding another or changing the current halt options. > > One possibility for syntax is --halt 10% to allow 10% to fail. > > > /Ole
