Re: [OMPI users] Latencies of atomic operations on high-performance networks
Ok, then it sounds like a regression. I will try to track it down today or tomorrow. -Nathan On Nov 08, 2018, at 01:41 PM, Joseph Schuchart wrote: Sorry for the delay, I wanted to make sure that I test the same version on both Aries and IB: git master bbe5da4. I realized that I had previously tested with 3.1.3 on the IB cluster, which ran fine. If I use the same version I run into the same problem on both systems (with --mca btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I have not tried using UCX for this. Joseph On 11/8/18 1:20 PM, Nathan Hjelm via users wrote: Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause. What Open MPI version are you using? -Nathan On Nov 08, 2018, at 11:10 AM, Joseph Schuchart wrote: While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for: ``` $ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote result:15000 ``` When setting the mca parameter the test fails with garbage in the result: ``` $ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remote result:25769849013 mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed. ``` All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster. Is that an issue in Open MPI or is there some problem in the test case that I am missing? Thanks in advance, Joseph On 11/6/18 1:15 PM, Joseph Schuchart wrote: Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph On 11/6/18 12:13 PM, Nathan Hjelm via users wrote: All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -Nathan On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless: * Linux Cluster, IB QDR * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies tha
Re: [OMPI users] Latencies of atomic operations on high-performance networks
Sorry for the delay, I wanted to make sure that I test the same version on both Aries and IB: git master bbe5da4. I realized that I had previously tested with 3.1.3 on the IB cluster, which ran fine. If I use the same version I run into the same problem on both systems (with --mca btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I have not tried using UCX for this. Joseph On 11/8/18 1:20 PM, Nathan Hjelm via users wrote: Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause. What Open MPI version are you using? -Nathan On Nov 08, 2018, at 11:10 AM, Joseph Schuchart wrote: While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for: ``` $ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote result:15000 ``` When setting the mca parameter the test fails with garbage in the result: ``` $ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remote result:25769849013 mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed. ``` All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster. Is that an issue in Open MPI or is there some problem in the test case that I am missing? Thanks in advance, Joseph On 11/6/18 1:15 PM, Joseph Schuchart wrote: Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph On 11/6/18 12:13 PM, Nathan Hjelm via users wrote: All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -Nathan On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless: * Linux Cluster, IB QDR * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit b) Using an exclusive lock leads to lower latencies Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetc
Re: [OMPI users] Latencies of atomic operations on high-performance networks
Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.What Open MPI version are you using?-NathanOn Nov 08, 2018, at 11:10 AM, Joseph Schuchart wrote:While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for:```$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:15000```When setting the mca parameter the test fails with garbage in the result:```$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:25769849013mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed.```All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster.Is that an issue in Open MPI or is there some problem in the test case that I am missing?Thanks in advance,JosephOn 11/6/18 1:15 PM, Joseph Schuchart wrote:Thanks a lot for the quick reply, settingosc_rdma_acc_single_intrinsic=true does the trick for both shared andexclusive locks and brings it down to <2us per operation. I hope thatthe info key will make it into the next version of the standard, Icertainly have use for it :)Cheers,JosephOn 11/6/18 12:13 PM, Nathan Hjelm via users wrote:All of this is completely expected. Due to the requirements of thestandard it is difficult to make use of network atomics even forMPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil theparty). If you want MPI_Fetch_and_op to be fast set this MCA parameter:osc_rdma_acc_single_intrinsic=trueShared lock is slower than an exclusive lock because there is an extralock step as part of the accumulate (it isn't needed if there is anexclusive lock). When setting the above parameter you are telling theimplementation that you will only be using a single count and we canoptimize that with the hardware. The RMA working group is working onan info key that will essentially do the same thing.Note the above parameter won't help you with IB if you are using UCXunless you set this (master only right now):btl_uct_transports=dc_mlx5btl=self,vader,uctosc=^ucxThough there may be a way to get osc/ucx to enable the same sort ofoptimization. I don't know.-NathanOn Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote:All,I am currently experimenting with MPI atomic operations and wanted toshare some interesting results I am observing. The numbers below aremeasurements from both an IB-based cluster and our Cray XC40. Thebenchmarks look like the following snippet:```if (rank == 1) {uint64_t res, val;for (size_t i = 0; i < NUM_REPS; ++i) {MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);MPI_Win_flush(target, win);}}MPI_Barrier(MPI_COMM_WORLD);```Only rank 1 performs atomic operations, rank 0 waits in a barrier (Ihave tried to confirm that the operations are done in hardware byletting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless:* Linux Cluster, IB QDR *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 4.323384uscompare_exchange: 2.035905usaccumulate: 4.326358usget_accumulate: 4.334831usExclusive lock, MPI_UINT64_T:fetch_op: 2.438080uscompare_exchange: 2.398836usaccumulate: 2.435378usget_accumulate: 2.448347usShared lock, MPI_UINT32_T:fetch_op: 6.819977uscompare_exchange: 4.551417usaccumulate: 6.807766usget_accumulate: 6.817602usShared lock, MPI_UINT64_T:fetch_op: 4.954860uscompare_exchange: 2.399373usaccumulate: 4.965702usget_accumulate: 4.977876usThere are two interesting observations:a) operations on 64bit operands generally seem to have lower latenciesthan operations on 32bitb) Using an exclusive lock leads to lower latenciesOverall, there is a factor of almost 3 between SharedLock+uint32_t andExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate(compare_exchange seems to be somewhat of an outlier).* Cray XC40, Aries *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 2.011794uscompare_exchange: 1.740825usaccumulate: 1.795500usget_accumulate: 1.985409usExclusive lock, MPI_UINT64_T:fetch_op: 2.017172uscompare_exchange: 1.846202usaccumulate: 1.812578usget_accumulate: 2.005541usShared lock, MPI_UINT32_T:fetch_op: 5.380455uscompare_exchange: 5.164458usaccumulate: 5.230184usget_accumulate: 5.399722usShared lock, MPI_UINT64_T:fetch_op: 5.415230uscompare_exchange: 1.855840usaccumul
Re: [OMPI users] Latencies of atomic operations on high-performance networks
While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for: ``` $ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote result:15000 ``` When setting the mca parameter the test fails with garbage in the result: ``` $ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remote result:25769849013 mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed. ``` All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster. Is that an issue in Open MPI or is there some problem in the test case that I am missing? Thanks in advance, Joseph On 11/6/18 1:15 PM, Joseph Schuchart wrote: Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph On 11/6/18 12:13 PM, Nathan Hjelm via users wrote: All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_single_intrinsic=true Shared lock is slower than an exclusive lock because there is an extra lock step as part of the accumulate (it isn't needed if there is an exclusive lock). When setting the above parameter you are telling the implementation that you will only be using a single count and we can optimize that with the hardware. The RMA working group is working on an info key that will essentially do the same thing. Note the above parameter won't help you with IB if you are using UCX unless you set this (master only right now): btl_uct_transports=dc_mlx5 btl=self,vader,uct osc=^ucx Though there may be a way to get osc/ucx to enable the same sort of optimization. I don't know. -Nathan On Nov 06, 2018, at 09:38 AM, Joseph Schuchart wrote: All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t res, val; for (size_t i = 0; i < NUM_REPS; ++i) { MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win); MPI_Win_flush(target, win); } } MPI_Barrier(MPI_COMM_WORLD); ``` Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless: * Linux Cluster, IB QDR * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 4.323384us compare_exchange: 2.035905us accumulate: 4.326358us get_accumulate: 4.334831us Exclusive lock, MPI_UINT64_T: fetch_op: 2.438080us compare_exchange: 2.398836us accumulate: 2.435378us get_accumulate: 2.448347us Shared lock, MPI_UINT32_T: fetch_op: 6.819977us compare_exchange: 4.551417us accumulate: 6.807766us get_accumulate: 6.817602us Shared lock, MPI_UINT64_T: fetch_op: 4.954860us compare_exchange: 2.399373us accumulate: 4.965702us get_accumulate: 4.977876us There are two interesting observations: a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit b) Using an exclusive lock leads to lower latencies Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate (compare_exchange seems to be somewhat of an outlier). * Cray XC40, Aries * average of 10 iterations Exclusive lock, MPI_UINT32_T: fetch_op: 2.011794us compare_exchange: 1.740825us accumulate: 1.795500us get_accumulate: 1.985409us Exclusive lock, MPI_UINT64_T: fetch_op: 2.017172us compare_exchange: 1.846202us accumulate: 1.812578us get_accumulate: 2.005541us Shared lock, MPI_UINT32_T: fetch_op: 5.380455us compare_exchange: 5.164458us accumulate: 5.230184us get_accumulate: 5.399722us Shared lock, MPI_UINT64_T: fetch_op: 5.415230us compare_exchange: 1.855840us accumulate: 5.212632us get_ac
Re: [OMPI users] Bug with Open-MPI Processor Count
Hello Ralph, Is there any update on this? Thanks, Adam LeBlanc On Fri, Nov 2, 2018 at 11:06 AM Adam LeBlanc wrote: > Hello Ralph, > > When I do the -np 7 it still fails with "There are not enough slots > available in the system to satisfy the 7 slots that were requested by the > application", but when I do -np 2 it will actually run from a machine that > was failing but will only run on one other machine and in this case it ran > from a machine with 2 processors to a machine with only 1 processor. If I > try to make -np higher then 2 it will also fail. > > -Adam LeBlanc > > On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain wrote: > >> Hmmm - try adding a value for nprocs instead of leaving it blank. Say >> “-np 7” >> >> Sent from my iPhone >> >> On Nov 1, 2018, at 11:56 AM, Adam LeBlanc wrote: >> >> Hello Ralph, >> >> Here is the output for a failing machine: >> >> [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca >> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 >> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues >> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca >> ras_base_verbose 5 IMB-MPI1 >> >> == ALLOCATED NODES == >> farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP >> hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> = >> -- >> There are not enough slots available in the system to satisfy the 7 slots >> that were requested by the application: >> 10 >> >> Either request fewer slots for your application, or make more slots >> available >> for use. >> -- >> >> >> Here is an output of a passing machine: >> >> [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca >> btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 >> --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues >> P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca >> ras_base_verbose 5 IMB-MPI1 >> >> == ALLOCATED NODES == >> hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP >> farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >> = >> >> >> Yes the hostfile is available on all nodes through an NFS mount for all >> of our home directories. >> >> On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc wrote: >> >>> >>> >>> -- Forwarded message - >>> From: Ralph H Castain >>> Date: Thu, Nov 1, 2018 at 2:34 PM >>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >>> To: Open MPI Users >>> >>> >>> I’m a little under the weather and so will only be able to help a bit at >>> a time. However, a couple of things to check: >>> >>> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought >>> the allocation was >>> >>> * is the hostfile available on every node? >>> >>> Ralph >>> >>> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc wrote: >>> >>> Hello Ralph, >>> >>> Attached below is the verbose output for a failing machine and a passing >>> machine. >>> >>> Thanks, >>> Adam LeBlanc >>> >>> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc >>> wrote: >>> -- Forwarded message - From: Ralph H Castain Date: Thu, Nov 1, 2018 at 1:07 PM Subject: Re: [OMPI users] Bug with Open-MPI Processor Count To: Open MPI Users Set rmaps_base_verbose=10 for debugging output Sent from my iPhone On Nov 1, 2018, at 9:31 AM, Adam LeBlanc wrote: The version by the way for Open-MPI is 3.1.2. -Adam LeBlanc On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc wrote: > Hello, I am an employee of the UNH InterOperability Lab, and we are in > the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have > purchased some new hardware that has one processor, and noticed an issue > when running mpi jobs on nodes that do not have similar processor count