Sorry, typo, try: mpirun -np 128 --debug-daemons -mca plm rsh hostname
Josh On Tue, Jan 28, 2020 at 12:45 PM Joshua Ladd <jladd.m...@gmail.com> wrote: > And if you try: > mpirun -np 128 --debug-daemons -plm rsh hostname > > Josh > > On Tue, Jan 28, 2020 at 12:34 PM Collin Strassburger < > cstrassbur...@bihrle.com> wrote: > >> Input: mpirun -np 128 --debug-daemons hostname >> >> >> >> Output: >> >> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs >> >> [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd >> >> [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - >> exiting >> >> -------------------------------------------------------------------------- >> >> mpirun was unable to start the specified application as it encountered an >> >> error: >> >> >> >> Error code: 63 >> >> Error name: (null) >> >> Node: Gen2Node3 >> >> >> >> when attempting to start process rank 0. >> >> -------------------------------------- >> >> >> >> Collin >> >> >> >> *From:* Joshua Ladd <jladd.m...@gmail.com> >> *Sent:* Tuesday, January 28, 2020 12:31 PM >> *To:* Collin Strassburger <cstrassbur...@bihrle.com> >> *Cc:* Open MPI Users <users@lists.open-mpi.org>; Ralph Castain < >> r...@open-mpi.org> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Interesting. Can you try: >> >> >> >> mpirun -np 128 --debug-daemons hostname >> >> >> >> Josh >> >> >> >> On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < >> cstrassbur...@bihrle.com> wrote: >> >> In relation to the multi-node attempt, I haven’t yet set that up yet as >> the per-node configuration doesn’t pass its tests (full node utilization, >> etc). >> >> >> >> Here are the results for the hostname test: >> >> Input: mpirun -np 128 hostname >> >> >> >> Output: >> >> -------------------------------------------------------------------------- >> >> mpirun was unable to start the specified application as it encountered an >> >> error: >> >> >> >> Error code: 63 >> >> Error name: (null) >> >> Node: Gen2Node3 >> >> >> >> when attempting to start process rank 0. >> >> -------------------------------------------------------------------------- >> >> 128 total processes failed to start >> >> >> >> >> >> Collin >> >> >> >> >> >> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph >> Castain via users >> *Sent:* Tuesday, January 28, 2020 12:06 PM >> *To:* Joshua Ladd <jladd.m...@gmail.com> >> *Cc:* Ralph Castain <r...@open-mpi.org>; Open MPI Users < >> users@lists.open-mpi.org> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Josh - if you read thru the thread, you will see that disabling >> Mellanox/IB drivers allows the program to run. It only fails when they are >> enabled. >> >> >> >> >> >> On Jan 28, 2020, at 8:49 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >> >> >> >> I don't see how this can be diagnosed as a "problem with the Mellanox >> Software". This is on a single node. What happens when you try to launch on >> more than one node? >> >> >> >> Josh >> >> >> >> On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < >> cstrassbur...@bihrle.com> wrote: >> >> Here’s the I/O for these high local core count runs. (“xhpcg” is the >> standard hpcg benchmark) >> >> >> >> Run command: mpirun -np 128 bin/xhpcg >> >> Output: >> >> -------------------------------------------------------------------------- >> >> mpirun was unable to start the specified application as it encountered an >> >> error: >> >> >> >> Error code: 63 >> >> Error name: (null) >> >> Node: Gen2Node4 >> >> >> >> when attempting to start process rank 0. >> >> -------------------------------------------------------------------------- >> >> 128 total processes failed to start >> >> >> >> >> >> Collin >> >> >> >> *From:* Joshua Ladd <jladd.m...@gmail.com> >> *Sent:* Tuesday, January 28, 2020 11:39 AM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com>; Ralph Castain < >> r...@open-mpi.org>; Artem Polyakov <art...@mellanox.com> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Can you send the output of a failed run including your command line. >> >> >> >> Josh >> >> >> >> On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < >> users@lists.open-mpi.org> wrote: >> >> Okay, so this is a problem with the Mellanox software - copying Artem. >> >> >> >> On Jan 28, 2020, at 8:15 AM, Collin Strassburger < >> cstrassbur...@bihrle.com> wrote: >> >> >> >> I just tried that and it does indeed work with pbs and without Mellanox >> (until a reboot makes it complain about Mellanox/IB related defaults as no >> drivers were installed, etc). >> >> >> >> After installing the Mellanox drivers, I used >> >> ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no --with-ucx >> --with-platform=contrib/platform/mellanox/optimized >> >> >> >> With the new compile it fails on the higher core counts. >> >> >> >> >> >> Collin >> >> >> >> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ralph >> Castain via users >> *Sent:* Tuesday, January 28, 2020 11:02 AM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> *Cc:* Ralph Castain <r...@open-mpi.org> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Does it work with pbs but not Mellanox? Just trying to isolate the >> problem. >> >> >> >> >> >> On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users < >> users@lists.open-mpi.org> wrote: >> >> >> >> Hello, >> >> >> >> I have done some additional testing and I can say that it works correctly >> with gcc8 and no mellanox or pbs installed. >> >> >> >> I am have done two runs with Mellanox and pbs installed. One run >> includes the actual run options I will be using while the other includes a >> truncated set which still compiles but fails to execute correctly. As the >> option with the actual run options results in a smaller config log, I am >> including it here. >> >> >> >> Version: 4.0.2 >> >> The config log is available at >> https://gist.github.com/BTemp1282020/fedca1aeed3b57296b8f21688ccae31c and >> the ompi dump is available athttps://pastebin.com/md3HwTUR. >> >> >> >> The IB network information (which is not being explicitly operated >> across): >> >> Packages: MLNX_OFED and Mellanox HPC-X, both are current versions >> (MLNX_OFED_LINUX-4.7-3.2.9.0-rhel8.1-x86_64 and >> hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat8.1-x86_64) >> >> Ulimit -l = unlimited >> >> Ibv_devinfo: >> >> hca_id: mlx4_0 >> >> transport: InfiniBand (0) >> >> fw_ver: 2.42.5000 >> >> … >> >> vendor_id: 0x02c9 >> >> vendor_part_id: 4099 >> >> hw_ver: 0x1 >> >> board_id: MT_1100120019 >> >> phys_port_cnt: 1 >> >> Device ports: >> >> port: 1 >> >> state: PORT_ACTIVE (4) >> >> max_mtu: 4096 (5) >> >> active_mtu: 4096 (5) >> >> sm_lid: 1 >> >> port_lid: 12 >> >> port_lmc: 0x00 >> >> link_layer: InfiniBand >> >> It looks like the rest of the IB information is in the config file. >> >> >> >> I hope this helps, >> >> Collin >> >> >> >> >> >> >> >> *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> *Sent:* Monday, January 27, 2020 3:40 PM >> *To:* Open MPI User's List <users@lists.open-mpi.org> >> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Can you please send all the information listed here: >> >> >> >> https://www.open-mpi.org/community/help/ >> >> >> >> Thanks! >> >> >> >> >> >> On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users < >> users@lists.open-mpi.org> wrote: >> >> >> >> Hello, >> >> >> >> I had initially thought the same thing about the streams, but I have 2 >> sockets with 64 cores each. Additionally, I have not yet turned >> multithreading off, so lscpu reports a total of 256 logical cores and 128 >> physical cores. As such, I don’t see how it could be running out of >> streams unless something is being passed incorrectly. >> >> >> >> Collin >> >> >> >> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Ray >> Sheppard via users >> *Sent:* Monday, January 27, 2020 11:53 AM >> *To:* users@lists.open-mpi.org >> *Cc:* Ray Sheppard <rshep...@iu.edu> >> *Subject:* Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD >> 7742 when utilizing 100+ processors per node >> >> >> >> Hi All, >> Just my two cents, I think error code 63 is saying it is running out of >> streams to use. I think you have only 64 cores, so at 100, you are >> overloading most of them. It feels like you are running out of resources >> trying to swap in and out ranks on physical cores. >> Ray >> >> On 1/27/2020 11:29 AM, Collin Strassburger via users wrote: >> >> This message was sent from a non-IU address. Please exercise caution when >> clicking links or opening attachments from external sources. >> >> >> >> Hello Howard, >> >> >> >> To remove potential interactions, I have found that the issue persists >> without ucx and hcoll support. >> >> >> >> Run command: mpirun -np 128 bin/xhpcg >> >> Output: >> >> -------------------------------------------------------------------------- >> >> mpirun was unable to start the specified application as it encountered an >> >> error: >> >> >> >> Error code: 63 >> >> Error name: (null) >> >> Node: Gen2Node4 >> >> >> >> when attempting to start process rank 0. >> >> -------------------------------------------------------------------------- >> >> 128 total processes failed to start >> >> >> >> It returns this error for any process I initialize with >100 processes >> per node. I get the same error message for multiple different codes, so >> the error code is mpi related rather than being program specific. >> >> >> >> Collin >> >> >> >> *From:* Howard Pritchard <hpprit...@gmail.com> <hpprit...@gmail.com> >> *Sent:* Monday, January 27, 2020 11:20 AM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> <users@lists.open-mpi.org> >> *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> >> <cstrassbur...@bihrle.com> >> *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when >> utilizing 100+ processors per node >> >> >> >> Hello Collen, >> >> >> >> Could you provide more information about the error. Is there any output >> from either Open MPI or, maybe, UCX, that could provide more information >> about the problem you are hitting? >> >> >> >> Howard >> >> >> >> >> >> Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < >> users@lists.open-mpi.org>: >> >> Hello, >> >> >> >> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of >> these versions cause the same error (error code 63) when utilizing more >> than 100 cores on a single node. The processors I am utilizing are AMD >> Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both >> the default gcc 8 and locally compiled gcc 9. I have already tried >> modifying the maximum name field values with no success. >> >> >> >> My compile options are: >> >> ./configure >> >> --prefix=${HPCX_HOME}/ompi >> >> --with-platform=contrib/platform/mellanox/optimized >> >> >> >> Any assistance would be appreciated, >> >> Collin >> >> >> >> Collin Strassburger >> >> Bihrle Applied Research Inc. >> >> >> >> >> >> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> >> >> >> >>