Thanks, YK and Pavel! It works. On Tue, Aug 2, 2011 at 4:52 PM, Yevgeny Kliteynik < klit...@dev.mellanox.co.il> wrote:
> See this FAQ entry: > http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc > > -- YK > > On 02-Aug-11 12:27 AM, Shamis, Pavel wrote: > > You may find some initial XRC tuning documentation here : > > > > https://svn.open-mpi.org/trac/ompi/ticket/1260 > > > > Pavel (Pasha) Shamis > > --- > > Application Performance Tools Group > > Computer Science and Math Division > > Oak Ridge National Laboratory > > > > > > > > > > > > > > On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote: > > > >> Hi, > >> > >> Please try running OMPI with XRC: > >> > >> mpirun --mca btl openib... --mca btl_openib_receive_queues > X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 > ... > >> > >> XRC (eXtended Reliable Connection) decreases memory consumption > >> of Open MPI by decreasing number of QP per machine. > >> > >> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm > >> sure it is on later version of the 1.4 series (1.4.3). > >> > >> BTW, I do know that the command line is extremely user friendly > >> and completely intuitive... :-) > >> I'll have an XRC entry on the OMPI FAQ web page in a day or two, > >> and you can find more details about this issue. > >> > >> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics > >> > >> -- YK > >> > >> On 28-Jul-11 7:53 AM, 吕慧伟 wrote: > >>> Dear all, > >>> > >>> I have encounted a problem concerns running large jobs on SMP cluster > with Open MPI 1.4. > >>> The application need all-to-all communication, each process send > messages to all other processes via MPI_Isend. It goes fine when running 256 > processes, the problems occurs when process number>=512. > >>> > >>> The error message is: > >>> mpirun -np 512 -machinefile machinefile.512 ./my_app > >>> > [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] > error creating qp errno says Cannot allocate memory > >>> ... > >>> > [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] > error in endpoint reply start connect > >>> > [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] > error creating qp errno says Cannot allocate memory > >>> ... > >>> mpirun has exited due to process rank 424 with PID 26841 on > >>> node gh31 exiting without calling "finalize". > >>> > >>> Related post (hxxp:// > www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests it may > run out of HCA QP resources. So I checked my system configuration with > 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 256 > processes would be under the limit: 256^2 = 65536< 261056, but 512^2 = > 262144> 261056. > >>> My question is: how to increase the max_qp number of InfiniBand or how > to get around this problem in MPI? > >>> > >>> Thanks in advance for any help you may give! > >>> > >>> Huiwei Lv > >>> PhD Student at Institute of Computing Technology > >>> > >>> ------------------------- > >>> p.s. The system informantion is provided below: > >>> $ ompi_info -v ompi full --parsable > >>> ompi:version:full:1.4 > >>> ompi:version:svn:r22285 > >>> ompi:version:release_date:Dec 08, 2009 > >>> $ uname -a > >>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 > x86_64 x86_64 x86_64 GNU/Linux > >>> $ ulimit -l > >>> unlimited > >>> $ ibv_devinfo -v > >>> hca_id: mlx4_0 > >>> transport: InfiniBand (0) > >>> fw_ver: 2.7.000 > >>> node_guid: 00d2:c910:0001:b6c0 > >>> sys_image_guid: 00d2:c910:0001:b6c3 > >>> vendor_id: 0x02c9 > >>> vendor_part_id: 26428 > >>> hw_ver: 0xB0 > >>> board_id: MT_0D20110009 > >>> phys_port_cnt: 1 > >>> max_mr_size: 0xffffffffffffffff > >>> page_size_cap: 0xfffffe00 > >>> max_qp: 261056 > >>> max_qp_wr: 16351 > >>> device_cap_flags: 0x00fc9c76 > >>> max_sge: 32 > >>> max_sge_rd: 0 > >>> max_cq: 65408 > >>> max_cqe: 4194303 > >>> max_mr: 524272 > >>> max_pd: 32764 > >>> max_qp_rd_atom: 16 > >>> max_ee_rd_atom: 0 > >>> max_res_rd_atom: 4176896 > >>> max_qp_init_rd_atom: 128 > >>> max_ee_init_rd_atom: 0 > >>> atomic_cap: ATOMIC_HCA (1) > >>> max_ee: 0 > >>> max_rdd: 0 > >>> max_mw: 0 > >>> max_raw_ipv6_qp: 0 > >>> max_raw_ethy_qp: 1 > >>> max_mcast_grp: 8192 > >>> max_mcast_qp_attach: 56 > >>> max_total_mcast_qp_attach: 458752 > >>> max_ah: 0 > >>> max_fmr: 0 > >>> max_srq: 65472 > >>> max_srq_wr: 16383 > >>> max_srq_sge: 31 > >>> max_pkeys: 128 > >>> local_ca_ack_delay: 15 > >>> port: 1 > >>> state: PORT_ACTIVE (4) > >>> max_mtu: 2048 (4) > >>> active_mtu: 2048 (4) > >>> sm_lid: 86 > >>> port_lid: 73 > >>> port_lmc: 0x00 > >>> link_layer: IB > >>> max_msg_sz: 0x40000000 > >>> port_cap_flags: 0x02510868 > >>> max_vl_num: 8 (4) > >>> bad_pkey_cntr: 0x0 > >>> qkey_viol_cntr: 0x0 > >>> sm_sl: 0 > >>> pkey_tbl_len: 128 > >>> gid_tbl_len: 128 > >>> subnet_timeout: 18 > >>> init_type_reply: 0 > >>> active_width: 4X (2) > >>> active_speed: 10.0 Gbps (4) > >>> phys_state: LINK_UP (5) > >>> GID[ 0]: > fe80:0000:0000:0000:00d2:c910:0001:b6c1 > >>> > >>> Related threads in the list: > >>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php > >>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users > > > >