Hi! > On 15 Oct 2015, at 4:38 , Ralph Castain <r...@open-mpi.org> wrote: > > Okay, please try the attached patch. It will cause two messages to be output > for each job: one indicating the job has been marked terminated, and the > other reporting that the completion message was sent to the requestor. Let's > see what that tells us.
In this run of 42, 6 did not return, therefore 36 completed successfully. $ grep TERMINATED dvm_output-patched.txt |wc -l 72 $ grep NOTIFYING dvm_output-patched.txt |wc -l 36 $ grep "Releasing job data" dvm_output-patched.txt |wc -l 77 $ grep "sess_dir_finalize" dvm_output-patched.txt |wc -l 36 $ grep "Releasing job data for.*," dvm_output-patched.txt|sort -k4 -t"," -n|wc -l 35 So interestingly this is 35, and not 36. $ grep "Releasing job data for.*," dvm_output-patched.txt|sort -k4 -t"," -n|head [netbook:06716] [[9528,0],0] Releasing job data for [9528,2] [netbook:06716] [[9528,0],0] Releasing job data for [9528,8] [netbook:06716] [[9528,0],0] Releasing job data for [9528,9] [netbook:06716] [[9528,0],0] Releasing job data for [9528,10] [netbook:06716] [[9528,0],0] Releasing job data for [9528,12] [netbook:06716] [[9528,0],0] Releasing job data for [9528,13] [netbook:06716] [[9528,0],0] Releasing job data for [9528,14] [netbook:06716] [[9528,0],0] Releasing job data for [9528,15] [netbook:06716] [[9528,0],0] Releasing job data for [9528,16] [netbook:06716] [[9528,0],0] Releasing job data for [9528,17] Which means task 1,3,4,5,6,7,11 didn't return. Which shows a clear bias towards the "early" tasks. Hopefully this provides you more insight. Thanks! Mark