As a follow up, the problem was with host name resolution. The error was
introduced, with a change to the Rocks environment, which broke reverse
lookups for host names.
--
Ray Muno
Rolf Vandevaart wrote:
>>
>> PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
>> environment variable: MPIRUN_RANK
>> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
>> Missing required environment variable: MPIRUN_RANK
>>
> I do not recognize these errors as pa
Ray Muno wrote:
Rolf Vandevaart wrote:
Ray Muno wrote:
Ray Muno wrote:
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE. MPI communication is over InfiniBand.
We also have OpenMPI 1.3 installed and receive s
Ray Muno wrote:
> Tha give me
How about "That gives me"
>
> PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
> environment variable: MPIRUN_RANK
> PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
> Missing required environment variable: MPIRUN_RANK
>
>
--
Rolf Vandevaart wrote:
> Ray Muno wrote:
>> Ray Muno wrote:
>>
>>> We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
>>> Scheduling is done through SGE. MPI communication is over InfiniBand.
>>>
>>>
>>
>> We also have OpenMPI 1.3 installed and receive similar errors.-
>
Ray Muno wrote:
Ray Muno wrote:
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE. MPI communication is over InfiniBand.
We also have OpenMPI 1.3 installed and receive similar errors.-
This does sound like a problem with SGE. By
Ray Muno wrote:
> We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
> Scheduling is done through SGE. MPI communication is over InfiniBand.
>
We also have OpenMPI 1.3 installed and receive similar errors.-
--
Ray Muno
University of Minnesota
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE. MPI communication is over InfiniBand.
We have been running with this setup for over 9 months. Last week, all
user jobs stopped executing (cluster load dropped to zero). User can
schedule jobs b