RDMACM creates the same QPs with the same tunings as OOB, so I don't see how 
CPC may effect on performance.

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote:

> +1 on what Pasha said -- if using rdmacm fixes the problem, then there's 
> something else nefarious going on...
>
> You might want to check padb with your hangs to see where all the processes 
> are hung to see if anything obvious jumps out.  I'd be surprised if there's a 
> bug in the oob cpc; it's been around for a long, long time; it should be 
> pretty stable.
>
> Do we create QP's differently between oob and rdmacm, such that perhaps they 
> are "better" (maybe better routed, or using a different SL, or ...) when 
> created via rdmacm?
>
>
> On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:
>
>> RDMACM or OOB can not effect on performance of this benchmark, since they 
>> are not involved in communication. So I'm not sure that the performance 
>> changes that you see are related to connection manager changes.
>> About oob - I'm not aware about hangs issue there, the code is very-very 
>> old, we did not touch it for a long time.
>>
>> Regards,
>>
>> Pavel (Pasha) Shamis
>> ---
>> Application Performance Tools Group
>> Computer Science and Math Division
>> Oak Ridge National Laboratory
>> Email: sham...@ornl.gov
>>
>>
>>
>>
>>
>> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:
>>
>>> Hi,
>>>
>>> For the first problem, I can see that when using rdmacm as openib oob
>>> I get much better performence results (and no hangs!).
>>>
>>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
>>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
>>> imb/src/IMB-MPI1 gather -npmin 64
>>>
>>>
>>> #bytes      #repetitions        t_min[usec]     t_max[usec]     t_avg[usec]
>>>
>>> 0       1000        0.04        0.05        0.05
>>>
>>> 1       1000        19.64       19.69       19.67
>>>
>>> 2       1000        19.97       20.02       19.99
>>>
>>> 4       1000        21.86       21.96       21.89
>>>
>>> 8       1000        22.87       22.94       22.90
>>>
>>> 16      1000        24.71       24.80       24.76
>>>
>>> 32      1000        27.23       27.32       27.27
>>>
>>> 64      1000        30.96       31.06       31.01
>>>
>>> 128     1000        36.96       37.08       37.02
>>>
>>> 256     1000        42.64       42.79       42.72
>>>
>>> 512     1000        60.32       60.59       60.46
>>>
>>> 1024    1000        82.44       82.74       82.59
>>>
>>> 2048    1000        497.66      499.62      498.70
>>>
>>> 4096    1000        684.15      686.47      685.33
>>>
>>> 8192    519         544.07      546.68      545.85
>>>
>>> 16384   519         653.20      656.23      655.27
>>>
>>> 32768   519         704.48      707.55      706.60
>>>
>>> 65536   519         918.00      922.12      920.86
>>>
>>> 131072  320         2414.08     2422.17     2418.20
>>>
>>> 262144  160         4198.25     4227.58     4213.19
>>>
>>> 524288  80          7333.04     7503.99     7438.18
>>>
>>> 1048576 40          13692.60    14150.20    13948.75
>>>
>>> 2097152 20          30377.34    32679.15    31779.86
>>>
>>> 4194304 10          61416.70    71012.50    68380.04
>>>
>>> How can the oob cause the hang? Isn't it only used to bring up the 
>>> connection?
>>> Does the oob has any part of the connections were made?
>>>
>>> Thanks,
>>> Dororn
>>>
>>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham <doron.o...@gmail.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, 
>>>> 24GB  memory.
>>>>
>>>>
>>>>
>>>> ·         Problem 1 – OMPI 1.4.3 hangs in gather:
>>>>
>>>>
>>>>
>>>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).
>>>>
>>>> It happens when np >= 64 and message size exceed 4k:
>>>>
>>>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib  
>>>> imb/src-1.4.2/IMB-MPI1 gather –npmin 64
>>>>
>>>>
>>>>
>>>> voltairenodes consists of 64 machines.
>>>>
>>>>
>>>>
>>>> #----------------------------------------------------------------
>>>>
>>>> # Benchmarking Gather
>>>>
>>>> # #processes = 64
>>>>
>>>> #----------------------------------------------------------------
>>>>
>>>>     #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>>>>
>>>>          0         1000         0.02         0.02         0.02
>>>>
>>>>          1          331        14.02        14.16        14.09
>>>>
>>>>          2          331        12.87        13.08        12.93
>>>>
>>>>          4          331        14.29        14.43        14.34
>>>>
>>>>          8          331        16.03        16.20        16.11
>>>>
>>>>         16          331        17.54        17.74        17.64
>>>>
>>>>         32          331        20.49        20.62        20.53
>>>>
>>>>         64          331        23.57        23.84        23.70
>>>>
>>>>        128          331        28.02        28.35        28.18
>>>>
>>>>        256          331        34.78        34.88        34.80
>>>>
>>>>        512          331        46.34        46.91        46.60
>>>>
>>>>       1024          331        63.96        64.71        64.33
>>>>
>>>>       2048          331       460.67       465.74       463.18
>>>>
>>>>       4096          331       637.33       643.99       640.75
>>>>
>>>>
>>>>
>>>> This the padb output:
>>>>
>>>> padb –A –x –Ormgr=mpirun –tree:
>>>>
>>>>
>>>>
>>>> =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 
>>>> =~=~=~=~=~=~=~=~=~=~=~=
>>>>
>>>>
>>>>
>>>> Warning, remote process state differs across ranks
>>>>
>>>> state : ranks
>>>>
>>>> R (running) : 
>>>> [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]
>>>>
>>>> S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]
>>>>
>>>> Stack trace(s) for thread: 1
>>>>
>>>> -----------------
>>>>
>>>> [0-63] (64 processes)
>>>>
>>>> -----------------
>>>>
>>>> main() at ?:?
>>>>
>>>> IMB_init_buffers_iter() at ?:?
>>>>
>>>>  IMB_gather() at ?:?
>>>>
>>>>    PMPI_Gather() at pgather.c:175
>>>>
>>>>      mca_coll_sync_gather() at coll_sync_gather.c:46
>>>>
>>>>        ompi_coll_tuned_gather_intra_dec_fixed() at 
>>>> coll_tuned_decision_fixed.c:714
>>>>
>>>>          -----------------
>>>>
>>>>          [0,3-63] (62 processes)
>>>>
>>>>          -----------------
>>>>
>>>>          ompi_coll_tuned_gather_intra_linear_sync() at 
>>>> coll_tuned_gather.c:248
>>>>
>>>>            mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>>
>>>>              ompi_request_wait_completion() at 
>>>> ../../../../ompi/request/request.h:375
>>>>
>>>>                opal_condition_wait() at 
>>>> ../../../../opal/threads/condition.h:99
>>>>
>>>>          -----------------
>>>>
>>>>          [1] (1 processes)
>>>>
>>>>          -----------------
>>>>
>>>>          ompi_coll_tuned_gather_intra_linear_sync() at 
>>>> coll_tuned_gather.c:302
>>>>
>>>>            mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>>
>>>>              ompi_request_wait_completion() at 
>>>> ../../../../ompi/request/request.h:375
>>>>
>>>>                opal_condition_wait() at 
>>>> ../../../../opal/threads/condition.h:99
>>>>
>>>>          -----------------
>>>>
>>>>          [2] (1 processes)
>>>>
>>>>          -----------------
>>>>
>>>>          ompi_coll_tuned_gather_intra_linear_sync() at 
>>>> coll_tuned_gather.c:315
>>>>
>>>>            ompi_request_default_wait() at request/req_wait.c:37
>>>>
>>>>              ompi_request_wait_completion() at 
>>>> ../ompi/request/request.h:375
>>>>
>>>>                opal_condition_wait() at ../opal/threads/condition.h:99
>>>>
>>>> Stack trace(s) for thread: 2
>>>>
>>>> -----------------
>>>>
>>>> [0-63] (64 processes)
>>>>
>>>> -----------------
>>>>
>>>> start_thread() at ?:?
>>>>
>>>> btl_openib_async_thread() at btl_openib_async.c:344
>>>>
>>>>  poll() at ?:?
>>>>
>>>> Stack trace(s) for thread: 3
>>>>
>>>> -----------------
>>>>
>>>> [0-63] (64 processes)
>>>>
>>>> -----------------
>>>>
>>>> start_thread() at ?:?
>>>>
>>>> service_thread_start() at btl_openib_fd.c:427
>>>>
>>>>  select() at ?:?
>>>>
>>>> -bash-3.2$
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> When running again padb after couple of minutes, I can see that the total 
>>>> number of processes remain in the same position but
>>>>
>>>> different processes are at different positions.
>>>>
>>>> For example, this is the diff between two padb outputs:
>>>>
>>>>
>>>>
>>>> Warning, remote process state differs across ranks
>>>>
>>>> state : ranks
>>>>
>>>> -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]
>>>>
>>>> -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]
>>>>
>>>> +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]
>>>>
>>>> +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]
>>>>
>>>> Stack trace(s) for thread: 1
>>>>
>>>> -----------------
>>>>
>>>> [0-63] (64 processes)
>>>>
>>>> @@ -13,21 +13,21 @@
>>>>
>>>> mca_coll_sync_gather() at coll_sync_gather.c:46
>>>>
>>>> ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714
>>>>
>>>> -----------------
>>>>
>>>> - [0,3-63] (62 processes)
>>>>
>>>> + [0-5,8-63] (62 processes)
>>>>
>>>> -----------------
>>>>
>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248
>>>>
>>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>>
>>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>>
>>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>>
>>>> -----------------
>>>>
>>>> - [1] (1 processes)
>>>>
>>>> + [6] (1 processes)
>>>>
>>>> -----------------
>>>>
>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302
>>>>
>>>> mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>>
>>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>>
>>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>>
>>>> -----------------
>>>>
>>>> - [2] (1 processes)
>>>>
>>>> + [7] (1 processes)
>>>>
>>>> -----------------
>>>>
>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315
>>>>
>>>> ompi_request_default_wait() at request/req_wait.c:37
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Choosing different gather algorithm seems to bypass the hang.
>>>>
>>>> I’ve used the following mca parameters:
>>>>
>>>> --mca coll_tuned_use_dynamic_rules 1
>>>>
>>>> --mca coll_tuned_gather_algorithm 1
>>>>
>>>>
>>>>
>>>> Actually, both dec_fixed and basic_linear works while binomial and 
>>>> linear_sync doesn’t.
>>>>
>>>>
>>>>
>>>> With OMPI 1.5 it doesn’t hangs (with all gather algorithms) and it much 
>>>> faster (the number of repetitions is much higher):
>>>>
>>>> #----------------------------------------------------------------
>>>>
>>>> # Benchmarking Gather
>>>>
>>>> # #processes = 64
>>>>
>>>> #----------------------------------------------------------------
>>>>
>>>>     #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>>>>
>>>>          0         1000         0.02         0.03         0.02
>>>>
>>>>          1         1000        18.50        18.55        18.53
>>>>
>>>>          2         1000        18.17        18.25        18.22
>>>>
>>>>          4         1000        19.04        19.10        19.07
>>>>
>>>>          8         1000        19.60        19.67        19.64
>>>>
>>>>         16         1000        21.39        21.47        21.43
>>>>
>>>>         32         1000        24.83        24.91        24.87
>>>>
>>>>         64         1000        27.35        27.45        27.40
>>>>
>>>>        128         1000        33.23        33.34        33.29
>>>>
>>>>        256         1000        41.24        41.39        41.32
>>>>
>>>>        512         1000        52.62        52.81        52.71
>>>>
>>>>       1024         1000        73.20        73.46        73.32
>>>>
>>>>       2048         1000       416.36       418.04       417.22
>>>>
>>>>       4096         1000       638.54       640.70       639.65
>>>>
>>>>       8192         1000       506.26       506.97       506.63
>>>>
>>>>      16384         1000       600.63       601.40       601.02
>>>>
>>>>      32768         1000       639.52       640.34       639.93
>>>>
>>>>      65536          640       914.22       916.02       915.13
>>>>
>>>>     131072          320      2287.37      2295.18      2291.35
>>>>
>>>>     262144          160      4041.36      4070.58      4056.27
>>>>
>>>>     524288           80      7292.35      7463.27      7397.14
>>>>
>>>>    1048576           40     13647.15     14107.15     13905.29
>>>>
>>>>    2097152           20     30625.00     32635.45     31815.36
>>>>
>>>>    4194304           10     63543.01     70987.49     68680.48
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ·         Problem 2 – segmentation fault with OMPI 1.4.3/1.5 and IMB 
>>>> gather np=768:
>>>>
>>>> When trying to run the same command but with np=768 I get segmentation 
>>>> fault:
>>>>
>>>> openmpi-1.4.3/bin/mpirun -np 768 -machinefile voltairenodes -mca btl 
>>>> sm,self,openib -mca coll_tuned_use_dynamic_rules 1 -mca 
>>>> coll_tuned_gather_algorithm 1 imb/src/IMB-MPI1 gather -npmin 768 -mem 1.6
>>>>
>>>>
>>>>
>>>> This happens in OMPI 1.4.3 and 1.5
>>>>
>>>>
>>>>
>>>> [compa163:20249] *** Process received signal ***
>>>>
>>>> [compa163:20249] Signal: Segmentation fault (11)
>>>>
>>>> [compa163:20249] Signal code: Address not mapped (1)
>>>>
>>>> [compa163:20249] Failing at address: 0x2aab4a204000
>>>>
>>>> [compa163:20249] [ 0] /lib64/libpthread.so.0 [0x366aa0e7c0]
>>>>
>>>> [compa163:20249] [ 1] 
>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(ompi_convertor_unpack+0x15f)
>>>>  [0x2b077882282e]
>>>>
>>>> [compa163:20249] [ 2] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
>>>> [0x2b077b9e1672]
>>>>
>>>> [compa163:20249] [ 3] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
>>>> [0x2b077b9dd0b6]
>>>>
>>>> [compa163:20249] [ 4] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so 
>>>> [0x2b077c459d87]
>>>>
>>>> [compa163:20249] [ 5] 
>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0xbe)
>>>>  [0x2b0778d845b8]
>>>>
>>>> [compa163:20249] [ 6] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
>>>> [0x2b077b9d6d62]
>>>>
>>>> [compa163:20249] [ 7] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
>>>> [0x2b077b9d6ba7]
>>>>
>>>> [compa163:20249] [ 8] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
>>>> [0x2b077b9d6a90]
>>>>
>>>> [compa163:20249] [ 9] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>  [0x2b077d298dc5]
>>>>
>>>> [compa163:20249] [10] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>  [0x2b077d2990d3]
>>>>
>>>> [compa163:20249] [11] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>  [0x2b077d286e9b]
>>>>
>>>> [compa163:20249] [12] 
>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_sync.so
>>>>  [0x2b077d07e96c]
>>>>
>>>> [compa163:20249] [13] 
>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(PMPI_Gather+0x55e)
>>>>  [0x2b077883ec9a]
>>>>
>>>> [compa163:20249] [14] imb/src/IMB-MPI1(IMB_gather+0xe8) [0x40a088]
>>>>
>>>> [compa163:20249] [15] imb/src/IMB-MPI1(IMB_init_buffers_iter+0x28a) 
>>>> [0x405baa]
>>>>
>>>> [compa163:20249] [16] imb/src/IMB-MPI1(main+0x30f) [0x40362f]
>>>>
>>>> [compa163:20249] [17] /lib64/libc.so.6(__libc_start_main+0xf4) 
>>>> [0x3669e1d994]
>>>>
>>>> [compa163:20249] [18] imb/src/IMB-MPI1 [0x403269]
>>>>
>>>> [compa163:20249] *** End of error message ***
>>>>
>>>>
>>>> Any ideas? More debuggin tips?
>>>>
>>>> Thanks,
>>>> Doron
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to