Okay, fixed and cmr'd to you

On Mar 18, 2014, at 11:00 AM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Mar 18, 2014, at 10:54 AM, Dave Goodell (dgoodell) <dgood...@cisco.com> 
> wrote:
> 
>> Ralph,
>> 
>> I'm seeing problems with MPIEXEC_TIMEOUT in v1.7 @ r31103 (fairly close to 
>> HEAD):
>> 
>> ----8<----
>> MPIEXEC_TIMEOUT=8 mpirun --mca btl usnic,sm,self -np 4 ./sleeper
>> --------------------------------------------------------------------------
>> The user-provided time limit for job execution has been
>> reached:
>> 
>> MPIEXEC_TIMEOUT: 8 seconds
>> 
>> The job will now be aborted. Please check your code and/or
>> adjust/remove the job execution time limit (as specified
>> by MPIEXEC_TIMEOUT in your environment).
>> 
>> --------------------------------------------------------------------------
>> srun: error: mpi015: task 0: Killed
>> srun: Terminating job step 689585.2
>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>> ^C[savbu-usnic-a:26668] [[14634,0],0]->[[14634,0],1] 
>> mca_oob_tcp_msg_send_bytes: write failed: Connection reset by peer (104) [sd 
>> = 16]
>> [savbu-usnic-a:26668] [[14634,0],0]-[[14634,0],1] 
>> mca_oob_tcp_peer_send_handler: unable to send header
>> 
>> ^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly 
>> terminate
>> 
>> ^C
>> ----8<----
>> 
>> Where each of the "^C" is a ctrl-c with arbitrary was allowed to pass 
>> beforehand (several minutes for the first two, <5s in the third).
>> 
>> Where "sleeper" is just an MPI program that does:
>> 
>> ----8<----
>>   MPI_Init(&argc, &argv);
>>   MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
>>   MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>> 
>>   while (1) {
>>       sleep(60);
>>   }
>> 
>>   MPI_Finalize();
>> ----8<----
>> 
>> It happens under slurm and SSH.  If I launch on localhost (no 
>> --host/--hostfile option, no slurm, etc.) then it exits just fine.  The 
>> example output I gave above used the "usnic" BTL, but "tcp" has identical 
>> behavior.
>> 
>> This worked fine in v1.7.4.  I've bisected the change in behavior down to 
>> r30981: https://svn.open-mpi.org/trac/ompi/changeset/30981
>> 
>> Should I file a ticket?
>> 
> 
> Crud - no, I'll take a look in a little bit
> 
> 
>> -Dave
>> 
> 

Reply via email to