Hi all,

Has anyone run into problems recently with Cactus jobs on Stampede?  I've had 
jobs die when checkpointing, and also mysteriously hanging for no apparent 
reason.  These might be separate problems.  The checkpointing issue occurred 
when I submitted several jobs and they all started checkpointing at the same 
time after 3 hours.  The hang happened after a few hours of evolution, with GDB 
reporting

> MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
>   at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
> 296       for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
>   ++i)

Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've been in 
touch with support and they said the dying while checkpointing coincided with 
the filesystems being hit hard by my jobs, which makes sense, but they didn't 
see any problems in their logs, and they have no idea about the mysterious 
hang.  I repeated the hanging job and it ran fine.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to