Okay, so each orte-submit is reporting job has launched, which means the hang is coming while waiting to hear the job completed. Are you sure that orte-dvm believes the job has completed? In other words, when you say that you observe the job as completing, are you basing that on some output from orte-dvm, or because the procs have exited, or...?
I can send you a patch tonight that would cause orte-dvm to emit a "job completed" message when it determines each job has terminated - might help us take the next step. I'm wondering if orte-dvm thinks the job is still running, and the race condition is in that area (as opposed to being in orte-submit itself) On Wed, Oct 14, 2015 at 1:01 PM, Mark Santcroos <mark.santcr...@rutgers.edu> wrote: > Hi Ralph, > > On 14 Oct 2015, at 21:50 , Ralph Castain <r...@open-mpi.org> wrote: > > I wonder if they might be getting duplicate process names if started > quickly enough. Do you get the "job has launched" message (orte-submit > outputs a message after orte-dvm responds that the job launched)? > > Based on the output below I would say that both columns with IDs are > unique. > > Thanks > > Mark > > $ head orte-log.txt > [netbook:90327] Job [24532,1] has launched > [netbook:90326] Job [24532,2] has launched > [netbook:90331] Job [24532,3] has launched > [netbook:90330] Job [24532,4] has launched > [netbook:90332] Job [24532,5] has launched > [netbook:90328] Job [24532,6] has launched > [netbook:90329] Job [24532,7] has launched > [netbook:90325] Job [24532,8] has launched > [netbook:90335] Job [24532,9] has launched > [netbook:90333] Job [24532,10] has launched > > $ cat orte-log.txt | cut -f1 -d" "| sort | uniq -c | wc -l > 42 > $ cat orte-log.txt | cut -f3 -d" "| sort | uniq -c | wc -l > 42 > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18167.php >