Hi
I have been having problems running on Stampede for a long time. I
couldn't get the latest
stable ET to run because during checkpointing, it would die. I had to
backtrack to
the Orsted version (unfortunately, that has a bug in the way the grid is
set up, causing some of the
intermediate levels to span both black holes, wasting a lot of memory).
Even with
Orsted , stalling is a real issue. Currently, my "solution" is to run
for 4 hours at a time.
This would have been OK on Lonestar or Ranger,
because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I
believe Jim Healy opened
a ticket concerning the RIT issues with running ET on stampede.
On 05/02/2014 05:55 AM, Ian Hinder wrote:
Hi all,
Has anyone run into problems recently with Cactus jobs on Stampede?
I've had jobs die when checkpointing, and also mysteriously hanging
for no apparent reason. These might be separate problems. The
checkpointing issue occurred when I submitted several jobs and they
all started checkpointing at the same time after 3 hours. The hang
happened after a few hours of evolution, with GDB reporting
MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
296 for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
++i)
Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've
been in touch with support and they said the dying while checkpointing
coincided with the filesystems being hit hard by my jobs, which makes
sense, but they didn't see any problems in their logs, and they have
no idea about the mysterious hang. I repeated the hanging job and it
ran fine.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users