RDMACM creates the same QPs with the same tunings as OOB, so I don't see how CPC may effect on performance.
Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote: > +1 on what Pasha said -- if using rdmacm fixes the problem, then there's > something else nefarious going on... > > You might want to check padb with your hangs to see where all the processes > are hung to see if anything obvious jumps out. I'd be surprised if there's a > bug in the oob cpc; it's been around for a long, long time; it should be > pretty stable. > > Do we create QP's differently between oob and rdmacm, such that perhaps they > are "better" (maybe better routed, or using a different SL, or ...) when > created via rdmacm? > > > On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote: > >> RDMACM or OOB can not effect on performance of this benchmark, since they >> are not involved in communication. So I'm not sure that the performance >> changes that you see are related to connection manager changes. >> About oob - I'm not aware about hangs issue there, the code is very-very >> old, we did not touch it for a long time. >> >> Regards, >> >> Pavel (Pasha) Shamis >> --- >> Application Performance Tools Group >> Computer Science and Math Division >> Oak Ridge National Laboratory >> Email: sham...@ornl.gov >> >> >> >> >> >> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote: >> >>> Hi, >>> >>> For the first problem, I can see that when using rdmacm as openib oob >>> I get much better performence results (and no hangs!). >>> >>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl >>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100 >>> imb/src/IMB-MPI1 gather -npmin 64 >>> >>> >>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>> >>> 0 1000 0.04 0.05 0.05 >>> >>> 1 1000 19.64 19.69 19.67 >>> >>> 2 1000 19.97 20.02 19.99 >>> >>> 4 1000 21.86 21.96 21.89 >>> >>> 8 1000 22.87 22.94 22.90 >>> >>> 16 1000 24.71 24.80 24.76 >>> >>> 32 1000 27.23 27.32 27.27 >>> >>> 64 1000 30.96 31.06 31.01 >>> >>> 128 1000 36.96 37.08 37.02 >>> >>> 256 1000 42.64 42.79 42.72 >>> >>> 512 1000 60.32 60.59 60.46 >>> >>> 1024 1000 82.44 82.74 82.59 >>> >>> 2048 1000 497.66 499.62 498.70 >>> >>> 4096 1000 684.15 686.47 685.33 >>> >>> 8192 519 544.07 546.68 545.85 >>> >>> 16384 519 653.20 656.23 655.27 >>> >>> 32768 519 704.48 707.55 706.60 >>> >>> 65536 519 918.00 922.12 920.86 >>> >>> 131072 320 2414.08 2422.17 2418.20 >>> >>> 262144 160 4198.25 4227.58 4213.19 >>> >>> 524288 80 7333.04 7503.99 7438.18 >>> >>> 1048576 40 13692.60 14150.20 13948.75 >>> >>> 2097152 20 30377.34 32679.15 31779.86 >>> >>> 4194304 10 61416.70 71012.50 68380.04 >>> >>> How can the oob cause the hang? Isn't it only used to bring up the >>> connection? >>> Does the oob has any part of the connections were made? >>> >>> Thanks, >>> Dororn >>> >>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham <doron.o...@gmail.com> wrote: >>>> >>>> Hi >>>> >>>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, >>>> 24GB memory. >>>> >>>> >>>> >>>> · Problem 1 – OMPI 1.4.3 hangs in gather: >>>> >>>> >>>> >>>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). >>>> >>>> It happens when np >= 64 and message size exceed 4k: >>>> >>>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib >>>> imb/src-1.4.2/IMB-MPI1 gather –npmin 64 >>>> >>>> >>>> >>>> voltairenodes consists of 64 machines. >>>> >>>> >>>> >>>> #---------------------------------------------------------------- >>>> >>>> # Benchmarking Gather >>>> >>>> # #processes = 64 >>>> >>>> #---------------------------------------------------------------- >>>> >>>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>>> >>>> 0 1000 0.02 0.02 0.02 >>>> >>>> 1 331 14.02 14.16 14.09 >>>> >>>> 2 331 12.87 13.08 12.93 >>>> >>>> 4 331 14.29 14.43 14.34 >>>> >>>> 8 331 16.03 16.20 16.11 >>>> >>>> 16 331 17.54 17.74 17.64 >>>> >>>> 32 331 20.49 20.62 20.53 >>>> >>>> 64 331 23.57 23.84 23.70 >>>> >>>> 128 331 28.02 28.35 28.18 >>>> >>>> 256 331 34.78 34.88 34.80 >>>> >>>> 512 331 46.34 46.91 46.60 >>>> >>>> 1024 331 63.96 64.71 64.33 >>>> >>>> 2048 331 460.67 465.74 463.18 >>>> >>>> 4096 331 637.33 643.99 640.75 >>>> >>>> >>>> >>>> This the padb output: >>>> >>>> padb –A –x –Ormgr=mpirun –tree: >>>> >>>> >>>> >>>> =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 >>>> =~=~=~=~=~=~=~=~=~=~=~= >>>> >>>> >>>> >>>> Warning, remote process state differs across ranks >>>> >>>> state : ranks >>>> >>>> R (running) : >>>> [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63] >>>> >>>> S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60] >>>> >>>> Stack trace(s) for thread: 1 >>>> >>>> ----------------- >>>> >>>> [0-63] (64 processes) >>>> >>>> ----------------- >>>> >>>> main() at ?:? >>>> >>>> IMB_init_buffers_iter() at ?:? >>>> >>>> IMB_gather() at ?:? >>>> >>>> PMPI_Gather() at pgather.c:175 >>>> >>>> mca_coll_sync_gather() at coll_sync_gather.c:46 >>>> >>>> ompi_coll_tuned_gather_intra_dec_fixed() at >>>> coll_tuned_decision_fixed.c:714 >>>> >>>> ----------------- >>>> >>>> [0,3-63] (62 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at >>>> coll_tuned_gather.c:248 >>>> >>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104 >>>> >>>> ompi_request_wait_completion() at >>>> ../../../../ompi/request/request.h:375 >>>> >>>> opal_condition_wait() at >>>> ../../../../opal/threads/condition.h:99 >>>> >>>> ----------------- >>>> >>>> [1] (1 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at >>>> coll_tuned_gather.c:302 >>>> >>>> mca_pml_ob1_send() at pml_ob1_isend.c:125 >>>> >>>> ompi_request_wait_completion() at >>>> ../../../../ompi/request/request.h:375 >>>> >>>> opal_condition_wait() at >>>> ../../../../opal/threads/condition.h:99 >>>> >>>> ----------------- >>>> >>>> [2] (1 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at >>>> coll_tuned_gather.c:315 >>>> >>>> ompi_request_default_wait() at request/req_wait.c:37 >>>> >>>> ompi_request_wait_completion() at >>>> ../ompi/request/request.h:375 >>>> >>>> opal_condition_wait() at ../opal/threads/condition.h:99 >>>> >>>> Stack trace(s) for thread: 2 >>>> >>>> ----------------- >>>> >>>> [0-63] (64 processes) >>>> >>>> ----------------- >>>> >>>> start_thread() at ?:? >>>> >>>> btl_openib_async_thread() at btl_openib_async.c:344 >>>> >>>> poll() at ?:? >>>> >>>> Stack trace(s) for thread: 3 >>>> >>>> ----------------- >>>> >>>> [0-63] (64 processes) >>>> >>>> ----------------- >>>> >>>> start_thread() at ?:? >>>> >>>> service_thread_start() at btl_openib_fd.c:427 >>>> >>>> select() at ?:? >>>> >>>> -bash-3.2$ >>>> >>>> >>>> >>>> >>>> >>>> When running again padb after couple of minutes, I can see that the total >>>> number of processes remain in the same position but >>>> >>>> different processes are at different positions. >>>> >>>> For example, this is the diff between two padb outputs: >>>> >>>> >>>> >>>> Warning, remote process state differs across ranks >>>> >>>> state : ranks >>>> >>>> -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63] >>>> >>>> -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61] >>>> >>>> +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63] >>>> >>>> +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62] >>>> >>>> Stack trace(s) for thread: 1 >>>> >>>> ----------------- >>>> >>>> [0-63] (64 processes) >>>> >>>> @@ -13,21 +13,21 @@ >>>> >>>> mca_coll_sync_gather() at coll_sync_gather.c:46 >>>> >>>> ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714 >>>> >>>> ----------------- >>>> >>>> - [0,3-63] (62 processes) >>>> >>>> + [0-5,8-63] (62 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248 >>>> >>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104 >>>> >>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375 >>>> >>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99 >>>> >>>> ----------------- >>>> >>>> - [1] (1 processes) >>>> >>>> + [6] (1 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302 >>>> >>>> mca_pml_ob1_send() at pml_ob1_isend.c:125 >>>> >>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375 >>>> >>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99 >>>> >>>> ----------------- >>>> >>>> - [2] (1 processes) >>>> >>>> + [7] (1 processes) >>>> >>>> ----------------- >>>> >>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315 >>>> >>>> ompi_request_default_wait() at request/req_wait.c:37 >>>> >>>> >>>> >>>> >>>> >>>> Choosing different gather algorithm seems to bypass the hang. >>>> >>>> I’ve used the following mca parameters: >>>> >>>> --mca coll_tuned_use_dynamic_rules 1 >>>> >>>> --mca coll_tuned_gather_algorithm 1 >>>> >>>> >>>> >>>> Actually, both dec_fixed and basic_linear works while binomial and >>>> linear_sync doesn’t. >>>> >>>> >>>> >>>> With OMPI 1.5 it doesn’t hangs (with all gather algorithms) and it much >>>> faster (the number of repetitions is much higher): >>>> >>>> #---------------------------------------------------------------- >>>> >>>> # Benchmarking Gather >>>> >>>> # #processes = 64 >>>> >>>> #---------------------------------------------------------------- >>>> >>>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>>> >>>> 0 1000 0.02 0.03 0.02 >>>> >>>> 1 1000 18.50 18.55 18.53 >>>> >>>> 2 1000 18.17 18.25 18.22 >>>> >>>> 4 1000 19.04 19.10 19.07 >>>> >>>> 8 1000 19.60 19.67 19.64 >>>> >>>> 16 1000 21.39 21.47 21.43 >>>> >>>> 32 1000 24.83 24.91 24.87 >>>> >>>> 64 1000 27.35 27.45 27.40 >>>> >>>> 128 1000 33.23 33.34 33.29 >>>> >>>> 256 1000 41.24 41.39 41.32 >>>> >>>> 512 1000 52.62 52.81 52.71 >>>> >>>> 1024 1000 73.20 73.46 73.32 >>>> >>>> 2048 1000 416.36 418.04 417.22 >>>> >>>> 4096 1000 638.54 640.70 639.65 >>>> >>>> 8192 1000 506.26 506.97 506.63 >>>> >>>> 16384 1000 600.63 601.40 601.02 >>>> >>>> 32768 1000 639.52 640.34 639.93 >>>> >>>> 65536 640 914.22 916.02 915.13 >>>> >>>> 131072 320 2287.37 2295.18 2291.35 >>>> >>>> 262144 160 4041.36 4070.58 4056.27 >>>> >>>> 524288 80 7292.35 7463.27 7397.14 >>>> >>>> 1048576 40 13647.15 14107.15 13905.29 >>>> >>>> 2097152 20 30625.00 32635.45 31815.36 >>>> >>>> 4194304 10 63543.01 70987.49 68680.48 >>>> >>>> >>>> >>>> >>>> >>>> · Problem 2 – segmentation fault with OMPI 1.4.3/1.5 and IMB >>>> gather np=768: >>>> >>>> When trying to run the same command but with np=768 I get segmentation >>>> fault: >>>> >>>> openmpi-1.4.3/bin/mpirun -np 768 -machinefile voltairenodes -mca btl >>>> sm,self,openib -mca coll_tuned_use_dynamic_rules 1 -mca >>>> coll_tuned_gather_algorithm 1 imb/src/IMB-MPI1 gather -npmin 768 -mem 1.6 >>>> >>>> >>>> >>>> This happens in OMPI 1.4.3 and 1.5 >>>> >>>> >>>> >>>> [compa163:20249] *** Process received signal *** >>>> >>>> [compa163:20249] Signal: Segmentation fault (11) >>>> >>>> [compa163:20249] Signal code: Address not mapped (1) >>>> >>>> [compa163:20249] Failing at address: 0x2aab4a204000 >>>> >>>> [compa163:20249] [ 0] /lib64/libpthread.so.0 [0x366aa0e7c0] >>>> >>>> [compa163:20249] [ 1] >>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(ompi_convertor_unpack+0x15f) >>>> [0x2b077882282e] >>>> >>>> [compa163:20249] [ 2] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >>>> [0x2b077b9e1672] >>>> >>>> [compa163:20249] [ 3] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >>>> [0x2b077b9dd0b6] >>>> >>>> [compa163:20249] [ 4] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so >>>> [0x2b077c459d87] >>>> >>>> [compa163:20249] [ 5] >>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0xbe) >>>> [0x2b0778d845b8] >>>> >>>> [compa163:20249] [ 6] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >>>> [0x2b077b9d6d62] >>>> >>>> [compa163:20249] [ 7] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >>>> [0x2b077b9d6ba7] >>>> >>>> [compa163:20249] [ 8] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so >>>> [0x2b077b9d6a90] >>>> >>>> [compa163:20249] [ 9] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so >>>> [0x2b077d298dc5] >>>> >>>> [compa163:20249] [10] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so >>>> [0x2b077d2990d3] >>>> >>>> [compa163:20249] [11] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so >>>> [0x2b077d286e9b] >>>> >>>> [compa163:20249] [12] >>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_sync.so >>>> [0x2b077d07e96c] >>>> >>>> [compa163:20249] [13] >>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(PMPI_Gather+0x55e) >>>> [0x2b077883ec9a] >>>> >>>> [compa163:20249] [14] imb/src/IMB-MPI1(IMB_gather+0xe8) [0x40a088] >>>> >>>> [compa163:20249] [15] imb/src/IMB-MPI1(IMB_init_buffers_iter+0x28a) >>>> [0x405baa] >>>> >>>> [compa163:20249] [16] imb/src/IMB-MPI1(main+0x30f) [0x40362f] >>>> >>>> [compa163:20249] [17] /lib64/libc.so.6(__libc_start_main+0xf4) >>>> [0x3669e1d994] >>>> >>>> [compa163:20249] [18] imb/src/IMB-MPI1 [0x403269] >>>> >>>> [compa163:20249] *** End of error message *** >>>> >>>> >>>> Any ideas? More debuggin tips? >>>> >>>> Thanks, >>>> Doron >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel