I think it's a sm bug again I tested with the latest revision, I think it was r19588 ( before Jeff shuted the svn down). I run the mpi_p test ( BW between pairs of nodes ) with many nodes and it got stacked, it also works without sm. I am sorry I couldn't test it earlier. # i=1 ; while [ 1 ] ; do echo " ****************** i=$i ******** "; /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 84 -hostfile hostfile /home/USERS/lenny/TESTS/TRUNK/mpi_p1_4_TRUNK -t bw ; let i=i+1; sleep 1 ; done ****************** i=1 ******** BW (84) (size min max avg) 1048576 660.152249 2075.115025 1325.838953 ****************** i=2 ******** [stucked]
p.s. I will be on vacation until 5-Oct, I hope to fallow mails and run few tests. Best Regards Lenny. On Thu, Sep 25, 2008 at 6:44 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > Note that there *are* other changes to the openib BTL in that branch > besides just the CPC (meaning: changing the CPC meant changing other things > as well). > > So if you can run with the trunk and you can't run with this branch, then > there may be something different wrong with the hg tree other than just the > RDMA CM stuff... > > Let me know what you find. > > > On Sep 25, 2008, at 9:21 AM, Lenny Verkhovsky wrote: > > after few more tests is seems like -mca btl_openib_cpc_include oob hangs >> too. >> >> so, maybe it's something environmental. >> >> let me recheck it. >> >> >> On 9/25/08, Jeff Squyres <jsquy...@cisco.com> wrote: On Sep 25, 2008, at >> 7:25 AM, Lenny Verkhovsky wrote: >> >> I have RDMACM got hanged on np=16 ( dual core dual cpu). >> >> >> Yuck. I've run all of the intel tests at 32 procs (4ppn). What exactly >> did you run and where exactly did it hang? Can you get stack traces? >> >> it seems like it got hanged on the last machine ( >> witch1,witch2,witch3,witch4) >> >> when I ctrl-c the mpirun, I got defunct procs on the last machine. >> >> #ps -ef |grep mpi >> root 5321 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct> >> root 5322 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct> >> root 5323 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct> >> root 5324 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct> >> >> >> Are you seeing ORTE problems? >> >> -- >> Jeff Squyres >> Cisco Systems >> >> >> > > -- > Jeff Squyres > Cisco Systems > >