
I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). Even with Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 hours at a time.
This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened
a ticket concerning the RIT issues with running ET on stampede.

On 05/02/2014 05:55 AM, Ian Hinder wrote:
Hi all,

Has anyone run into problems recently with Cactus jobs on Stampede? I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason. These might be separate problems. The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours. The hang happened after a few hours of evolution, with GDB reporting

MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
  at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
296    for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;

Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang. I repeated the hanging job and it ran fine.

Ian Hinder

Users mailing list

Users mailing list

Reply via email to