Hi all, Has anyone run into problems recently with Cactus jobs on Stampede? I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason. These might be separate problems. The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours. The hang happened after a few hours of evolution, with GDB reporting
> MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13) > at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296 > 296 for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size; > ++i) Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang. I repeated the hanging job and it ran fine. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder
_______________________________________________ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users