Okay, so each orte-submit is reporting job has launched, which means the
hang is coming while waiting to hear the job completed. Are you sure that
orte-dvm believes the job has completed? In other words, when you say that
you observe the job as completing, are you basing that on some output from
orte-dvm, or because the procs have exited, or...?

I can send you a patch tonight that would cause orte-dvm to emit a "job
completed" message when it determines each job has terminated - might help
us take the next step. I'm wondering if orte-dvm thinks the job is still
running, and the race condition is in that area (as opposed to being in
orte-submit itself)



On Wed, Oct 14, 2015 at 1:01 PM, Mark Santcroos <mark.santcr...@rutgers.edu>
wrote:

> Hi Ralph,
> > On 14 Oct 2015, at 21:50 , Ralph Castain <r...@open-mpi.org> wrote:
> > I wonder if they might be getting duplicate process names if started
> quickly enough. Do you get the "job has launched" message (orte-submit
> outputs a message after orte-dvm responds that the job launched)?
>
> Based on the output below I would say that both columns with IDs are
> unique.
>
> Thanks
>
> Mark
>
> $ head orte-log.txt
> [netbook:90327] Job [24532,1] has launched
> [netbook:90326] Job [24532,2] has launched
> [netbook:90331] Job [24532,3] has launched
> [netbook:90330] Job [24532,4] has launched
> [netbook:90332] Job [24532,5] has launched
> [netbook:90328] Job [24532,6] has launched
> [netbook:90329] Job [24532,7] has launched
> [netbook:90325] Job [24532,8] has launched
> [netbook:90335] Job [24532,9] has launched
> [netbook:90333] Job [24532,10] has launched
>
> $ cat orte-log.txt | cut -f1 -d" "| sort | uniq -c | wc -l
>       42
> $ cat orte-log.txt | cut -f3 -d" "| sort | uniq -c | wc -l
>       42
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18167.php
>

Reply via email to