Re: [OMPI users] performance issue with OpenMPI 1.10.1

Ralph Castain Thu, 17 Dec 2015 20:07:38 -0500 (EST)

Glad you resolved it! The following MCA param has changed its name:

> rmaps_base_bycore=1


should now be

rmaps_base_mapping_policy=core

HTH
Ralph


> On Dec 17, 2015, at 5:01 PM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> The "mpirun --hetero-nodes -bind-to core -map-by core" resolves the 
> performance issue! 
> 
> I reran my test in the *same* job. 
> SLURM resource request:
> #!/bin/sh
> #SBATCH -N 4
> #SBATCH -n 64
> #SBATCH --mem=2g
> #SBATCH --time=02:00:00
> #SBATCH --error=job.%J.err
> #SBATCH --output=job.%J.out
> 
> env | grep SLURM:
> SLURM_CHECKPOINT_IMAGE_DIR=/lustre/work/swanson/jingchao/mpitest/examples/1.10.1/3
> SLURM_NODELIST=c[3005,3011,3019,3105]
> SLURM_JOB_NAME=submit
> SLURMD_NODENAME=c3005
> SLURM_TOPOLOGY_ADDR=s0.s5.c3005
> SLURM_PRIO_PROCESS=0
> SLURM_NODE_ALIASES=(null)
> SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
> SLURM_NNODES=4
> SLURM_JOBID=5462202
> SLURM_NTASKS=64
> SLURM_TASKS_PER_NODE=34,26,2(x2)
> SLURM_JOB_ID=5462202
> SLURM_JOB_USER=jingchao
> SLURM_JOB_UID=3663
> SLURM_NODEID=0
> SLURM_SUBMIT_DIR=/lustre/work/swanson/jingchao/mpitest/examples/1.10.1/3
> SLURM_TASK_PID=53822
> SLURM_NPROCS=64
> SLURM_CPUS_ON_NODE=36
> SLURM_PROCID=0
> SLURM_JOB_NODELIST=c[3005,3011,3019,3105]
> SLURM_LOCALID=0
> SLURM_JOB_CPUS_PER_NODE=36,26,2(x2)
> SLURM_CLUSTER_NAME=tusker
> SLURM_GTIDS=0
> SLURM_SUBMIT_HOST=login.tusker.hcc.unl.edu <http://login.tusker.hcc.unl.edu/>
> SLURM_JOB_PARTITION=batch
> SLURM_JOB_NUM_NODES=4
> SLURM_MEM_PER_NODE=2048
> 
> v-1.8.4 "mpirun" and v-1.10.1 "mpirun --hetero-nodes -bind-to core -map-by 
> core" now give comparable results.  
> v-1.10.1 "mpirun" still have unstable performance.
> 
> 
> I tried adding the following three lines to the "openmpi-mca-params.conf" file
> "
> orte_hetero_nodes=1
> hwloc_base_binding_policy=core
> rmaps_base_bycore=1
> "
> and ran "mpirun lmp_ompi_g++ < in.wall.2d" with v-1.10.1.
> 
> This works for most tests but some jobs are hanging with this message:
> --------------------------------------------------------------------------
> The following command line options and corresponding MCA parameter have
> been deprecated and replaced as follows:
> 
>   Command line options:
>     Deprecated:  --bycore, -bycore
>     Replacement: --map-by core
> 
>   Equivalent MCA parameter:
>     Deprecated:  rmaps_base_bycore
>     Replacement: rmaps_base_mapping_policy=core
> 
> The deprecated forms *will* disappear in a future version of Open MPI.
> Please update to the new syntax.
> --------------------------------------------------------------------------
> 
> Did I miss something in the "openmpi-mca-params.conf" file?
> 
> Thanks,
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> 
> 
> From: users <users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>> 
> on behalf of Gilles Gouaillardet <gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp>>
> Sent: Wednesday, December 16, 2015 6:11 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1
>  
> binding is somehow involved in this, and i do not believe vader nor openib 
> are involved here.
> 
> Could you please run again with the two ompi versions but in the *same* job ?
> and before invoking mpirun, could you do
> env | grep SLURM
> 
> per your slurm request, you are running 64 tasks on 4 nodes.
> with 1.8.4, you end up running 14+14+14+22 tasks (not ideal, but quite 
> balanced)
> with 1.10.1, you end up running 2+2+12+48 tasks (very unbalanced)
> so it is quite unfair to compare these two runs.
> 
> also, still in the same job, can you add a third run with 1.10.1 and the 
> following options
> mpirun --hetero-nodes -bind-to core -map-by core ...
> and see if it helps
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> 
> On 12/17/2015 6:47 AM, Jingchao Zhang wrote:
>> Those jobs were launched with mpirun. Please see the attached files for the 
>> binding report with OMPI_MCA_hwloc_base_report_bindings=1.
>> 
>> Here is a snapshot for v-1.10.1:  
>> [c2613.tusker.hcc.unl.edu <http://c2613.tusker.hcc.unl.edu/>:12049] MCW rank 
>> 0 is not bound (or bound to all available processors)
>> [c2613.tusker.hcc.unl.edu <http://c2613.tusker.hcc.unl.edu/>:12049] MCW rank 
>> 1 is not bound (or bound to all available processors)
>> [c2615.tusker.hcc.unl.edu <http://c2615.tusker.hcc.unl.edu/>:11136] MCW rank 
>> 2 is not bound (or bound to all available processors)
>> [c2615.tusker.hcc.unl.edu <http://c2615.tusker.hcc.unl.edu/>:11136] MCW rank 
>> 3 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 9 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 10 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 11 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 12 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 13 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 14 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 15 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 4 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 5 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 6 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 7 is not bound (or bound to all available processors)
>> [c2907.tusker.hcc.unl.edu <http://c2907.tusker.hcc.unl.edu/>:64131] MCW rank 
>> 8 is not bound (or bound to all available processors)
>> 
>> The report for 1.8.4 doesn't have this issue. Any suggestions to resolve it? 
>> 
>> Thanks,
>> Jingchao
>> 
>> Dr. Jingchao Zhang
>> Holland Computing Center
>> University of Nebraska-Lincoln
>> 402-472-6400
>> 
>> 
>> From: users <users-boun...@open-mpi.org> <mailto:users-boun...@open-mpi.org> 
>> on behalf of Ralph Castain <r...@open-mpi.org> <mailto:r...@open-mpi.org>
>> Sent: Wednesday, December 16, 2015 1:52 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1
>>  
>> When I see such issues, I immediately start to think about binding patterns. 
>> How are these jobs being launched - with mpirun or srun? What do you see if 
>> you set OMPI_MCA_hwloc_base_report_bindings=1 in your environment?
>> 
>>> On Dec 16, 2015, at 11:15 AM, Jingchao Zhang <zh...@unl.edu 
>>> <mailto:zh...@unl.edu>> wrote:
>>> 
>>> Hi Gilles,
>>> 
>>> The LAMMPS jobs for both versions are pure MPI. In the SLURM script, 64 
>>> cores are requested from 4 nodes. So it's 64 MPI tasks and not necessarily 
>>> evenly distributed across all the nodes. (each node is equipped with 64 
>>> cores.)
>>> 
>>> I can reproduce the performance issue using the LAMMPS example 
>>> "VISCOSITY/in.wall.2d". The run time difference is a jaw-dropping 20 
>>> seconds (v-1.8.4) vs. 45 mins (v-1.10.1). Among the multiple tests, I do 
>>> have one job using v-1.10.1 finished in 20 seconds. Again, unstable 
>>> performance. We also tested other software packages such as cp2k, VASP and 
>>> Quantum Espresso, and they all have similar issues. 
>>> 
>>> Here is the decomposed MPI time in the LAMMPS job outputs.
>>> v-1.8.4 (Job execution time: 00:00:20)
>>> Loop time of 8.94962 on 64 procs for 50000 steps with 1020 atoms
>>> Pair  time (%) = 0.270092 (3.01791)
>>> Neigh time (%) = 0.0842548 (0.941435)
>>> Comm  time (%) = 3.3474 (37.4027)
>>> Outpt time (%) = 0.00901061 (0.100682)
>>> Other time (%) = 5.23886 (58.5373)
>>> 
>>> v-1.10.1 (Job execution time: 00:45:50)
>>> Loop time of 2003.07 on 64 procs for 50000 steps with 1020 atoms
>>> Pair  time (%) = 0.346776 (0.0173122)
>>> Neigh time (%) = 0.18047 (0.00900966)
>>> Comm  time (%) = 535.836 (26.7508)
>>> Outpt time (%) = 1.68608 (0.0841748)
>>> Other time (%) = 1465.02 (73.1387)
>>> 
>>> I wonder if you can share your config.log and ompi_info with your v-1.10.1 
>>> compilation. Hopefully we can find a solution by comparing the 
>>> configuration differences. We had been playing with the cma and vader 
>>> parameters but with no luck. 
>>> 
>>> Thanks,
>>> Jingchao
>>> 
>>> Dr. Jingchao Zhang
>>> Holland Computing Center
>>> University of Nebraska-Lincoln
>>> 402-472-6400
>>> 
>>> 
>>> From: users <users-boun...@open-mpi.org 
>>> <mailto:users-boun...@open-mpi.org>> on behalf of Gilles Gouaillardet < 
>>> <mailto:gil...@rist.or.jp>gil...@rist.or.jp <mailto:gil...@rist.or.jp>>
>>> Sent: Tuesday, December 15, 2015 12:11 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1
>>>  
>>> Hi,
>>> 
>>> First, can you check how many MPI tasks and OpenMP threads are used with 
>>> both ompi versions ?
>>> /* it should be 16 MPI tasks x no OpenMP threads */
>>> 
>>> can you also post both MPI task timing breakdown (from the output)
>>> 
>>> i tried a simple test with the VISCOSITY/in.wall.2d and i did not observe 
>>> any performance difference.
>>> 
>>> can you reproduce the performance drop with an input file from the examples 
>>> directory ?
>>> if not, can you post your in.snr input file ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 12/15/2015 7:18 AM, Jingchao Zhang wrote:
>>>> Hi all, 
>>>> 
>>>> We installed the latest release of OpenMPI 1.10.1 on our Linux cluster and 
>>>> find it having some performance issues. We tested the OpenMPI performance 
>>>> against the MD simulation package LAMMPS ( 
>>>> <http://lammps.sandia.gov/>http://lammps.sandia.gov/ 
>>>> <http://lammps.sandia.gov/>). Compared to our previous installation of 
>>>> version 1.8.4, the 1.10.1 is nearly three times slower when running on 
>>>> multiple nodes. Run time across four computing nodes have the following 
>>>> results:
>>>>    1.10.1  1.8.4
>>>> 1  0:09:39 0:09:21
>>>> 2  0:50:29 0:09:23
>>>> 3  0:50:29 0:09:28
>>>> 4  0:13:38 0:09:27
>>>> 5  0:10:43 0:09:34
>>>> Ave        0:27:00 0:09:27
>>>> 
>>>> Unit is hour:minute:second. Five tests are done for each case and the 
>>>> averaged run time is listed in the last row. Tests on single node have the 
>>>> same run time results for both 1.10.1 and 1.8.4. 
>>>> 
>>>> We use SLURM as our job scheduler and the submit script for the LAMMPS job 
>>>> is as below:
>>>> "#!/bin/sh
>>>> #SBATCH -N 4
>>>> #SBATCH -n 64
>>>> #SBATCH --mem=2g
>>>> #SBATCH --time=00:50:00
>>>> #SBATCH --error=job.%J.err
>>>> #SBATCH --output=job.%J.out
>>>> 
>>>> module load compiler/gcc/4.7
>>>> export PATH=$PATH:/util/opt/openmpi/1.10.1/gcc/4.7/bin
>>>> export 
>>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/util/opt/openmpi/1.10.1/gcc/4.7/lib
>>>> export INCLUDE=$INCLUDE:/util/opt/openmpi/1.10.1/gcc/4.7/include
>>>> 
>>>> mpirun lmp_ompi_g++ < in.snr"
>>>> 
>>>> The "lmp_ompi_g++" binary is compiled against gcc/4.7 and openmpi/1.10.1. 
>>>> The compiler flags and MPI information can be found in the attachments. 
>>>> The problem here as you can see is the unstable performance for v-1.10.1. 
>>>> I wonder if this is a configuration issue at the compilation stage. 
>>>> 
>>>> Below are some information I gathered according to the "Getting Help" 
>>>> page. 
>>>> Version of Open MPI that we are using:
>>>> Open MPI version: 1.10.1
>>>> Open MPI repo revision: v1.10.0-178-gb80f802
>>>> Open MPI release date: Nov 03, 2015
>>>> 
>>>> "config.log" and "ompi_info --all" information are enclosed in the 
>>>> attachment.
>>>> 
>>>> Network information:
>>>> 1. OpenFabrics version
>>>> Mellanox/vendor 2.4-1.0.4 Download: 
>>>> <http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz><http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz>
>>>>  
>>>> <http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz>
>>>> 
>>>> 2. Linux version
>>>> Scientific Linux release 6.6
>>>> 2.6.32-504.23.4.el6.x86_64
>>>> 
>>>> 3. subnet manager
>>>> OpenSM
>>>> 
>>>> 4. ibv_devinfo
>>>> hca_id: mlx4_0
>>>>         transport:                      InfiniBand (0)
>>>>         fw_ver:                         2.9.1000
>>>>         node_guid:                      0002:c903:0050:6190
>>>>         sys_image_guid:                 0002:c903:0050:6193
>>>>         vendor_id:                      0x02c9
>>>>         vendor_part_id:                 26428
>>>>         hw_ver:                         0xB0
>>>>         board_id:                       MT_0D90110009
>>>>         phys_port_cnt:                  1
>>>>                 port:   1
>>>>                         state:                  PORT_ACTIVE (4)
>>>>                         max_mtu:                4096 (5)
>>>>                         active_mtu:             4096 (5)
>>>>                         sm_lid:                 1
>>>>                         port_lid:               34
>>>>                         port_lmc:               0x00
>>>>                         link_layer:             InfiniBand
>>>> 
>>>> 5. ifconfig
>>>> em1       Link encap:Ethernet  HWaddr D0:67:E5:F9:20:76
>>>>           inet addr:10.138.25.3  Bcast:10.138.255.255  Mask:255.255.0.0
>>>>           inet6 addr: fe80::d267:e5ff:fef9:2076/64 Scope:Link
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>           RX packets:28977969 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:67069501 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:1000
>>>>           RX bytes:3588666680 (3.3 GiB)  TX bytes:8145183622 (7.5 GiB)
>>>> 
>>>> Ifconfig uses the ioctl access method to get the full address information, 
>>>> which limits hardware addresses to 8 bytes.
>>>> Because Infiniband address has 20 bytes, only the first 8 bytes are 
>>>> displayed correctly.
>>>> Ifconfig is obsolete! For replacement check ip.
>>>> ib0       Link encap:InfiniBand  HWaddr 
>>>> A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>>>           inet addr:10.137.25.3  Bcast:10.137.255.255  Mask:255.255.0.0
>>>>           inet6 addr: fe80::202:c903:50:6191/64 Scope:Link
>>>>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>>>>           RX packets:1776 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:418 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:1024
>>>>           RX bytes:131571 (128.4 KiB)  TX bytes:81418 (79.5 KiB)
>>>> 
>>>> lo        Link encap:Local Loopback
>>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>>           inet6 addr: ::1/128 Scope:Host
>>>>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>>>>           RX packets:40310687 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:40310687 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:0
>>>>           RX bytes:45601859442 (42.4 GiB)  TX bytes:45601859442 (42.4 GiB)
>>>> 
>>>> 6. ulimit -l
>>>> unlimited
>>>> 
>>>> Please kindly let me know if more information are needed.
>>>> 
>>>> Thanks,
>>>> Jingchao
>>>> 
>>>> Dr. Jingchao Zhang
>>>> Holland Computing Center
>>>> University of Nebraska-Lincoln
>>>> 402-472-6400
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/12/28160.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/12/28160.php>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription:  
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post:  
>>> <http://www.open-mpi.org/community/lists/users/2015/12/28166.php>http://www.open-mpi.org/community/lists/users/2015/12/28166.php
>>>  <http://www.open-mpi.org/community/lists/users/2015/12/28166.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/12/28169.php 
>> <http://www.open-mpi.org/community/lists/users/2015/12/28169.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/12/28183.php 
> <http://www.open-mpi.org/community/lists/users/2015/12/28183.php>

Re: [OMPI users] performance issue with OpenMPI 1.10.1

Reply via email to