Re: [OMPI users] Job fails after hours of running on a specific node

2009-12-07 Thread Sangamesh B
Hello Pasha, As the error was not repeating frequently, I didn't look into the issue from a long time. But now I started to diagnose it: Initially I tested with ibv_rc_pingpong (Master node to all compute nodes using a for loop). Its working for each of the nodes. The files generated o

Re: [OMPI users] Job fails after hours of running on a specific node

2009-09-21 Thread Pavel Shamis (Pasha)
Sangamesh, The ib tunings that you added to your command line only delay the problem but doesn't resolve it. The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as result the processes fails to deliver packets to some remote hosts and as result you see bunch of IB errors. T