Hello Howard,

To remove potential interactions, I have found that the issue persists without 
ucx and hcoll support.

Run command: mpirun -np 128 bin/xhpcg
Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:

Error code: 63
Error name: (null)
Node: Gen2Node4

when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start

It returns this error for any process I initialize with >100 processes per 
node.  I get the same error message for multiple different codes, so the error 
code is mpi related rather than being program specific.

Collin

From: Howard Pritchard <hpprit...@gmail.com>
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Collin Strassburger <cstrassbur...@bihrle.com>
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ 
processors per node

Hello Collen,

Could you provide more information about the error.  Is there any output from 
either Open MPI or, maybe, UCX, that could provide more information about the 
problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>:
Hello,

I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of these 
versions cause the same error (error code 63) when utilizing more than 100 
cores on a single node.  The processors I am utilizing are AMD Epyc “Rome” 
7742s.  The OS is CentOS 8.1.  I have tried compiling with both the default gcc 
8 and locally compiled gcc 9.  I have already tried modifying the maximum name 
field values with no success.

My compile options are:
./configure
     --prefix=${HPCX_HOME}/ompi
     --with-platform=contrib/platform/mellanox/optimized

Any assistance would be appreciated,
Collin

Collin Strassburger
Bihrle Applied Research Inc.

Reply via email to