I'm working on setting up a cloud partition, and running into some 
communications problems between my nodes. This looks like something I have 
misconfigured, or information I haven't correctly supplied to slurm, but the 
low-level nature of the error has made it hard for me to figure out what I've 
done wrong.

I have a batch script which is essentially:

        #!/bin/sh
        #SBATCH --time=2
        #SBATCH --partition=cloud
        #SBATCH --ntasks=8
        #SBATCH --cpus-per-task=1
        srun -vvvvv --slurmd-debug=verbose singularity exec my-image.sif some 
args

Submitting this with `sbatch`, two 4 core VM nodes are started up as expected, 
the batch script is sent to one, and begins executing the `srun`. That seems to 
allocate the necessary job steps, but then fails when trying to communicate 
with the nodes in the allocation to start the tasks:

        srun: jobid 320: nodes(2):`ec[0-1]', cpu counts: 4(x2)
        srun: debug2: creating job with 8 tasks
        srun: debug:  requesting job 320, user 1000, nodes 2 including ((null))
        srun: debug:  cpus 8, tasks 8, name singularity, relative 65534
        srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
        srun: debug:  Entering slurm_step_launch
        srun: debug:  mpi type = (null)
        srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
        srun: debug:  Entering _msg_thr_create()
        srun: debug4: eio: handling events for 2 objects
        srun: debug3: eio_message_socket_readable: shutdown 0 fd 9
        srun: debug3: eio_message_socket_readable: shutdown 0 fd 5
        srun: debug:  initialized stdio listening socket, port 43793
        srun: debug:  Started IO server thread (139796182816512)
        srun: debug:  Entering _launch_tasks
        srun: debug3: IO thread pid = 1507
        srun: debug4: eio: handling events for 4 objects
        srun: debug2: Called _file_readable
        srun: debug3:   false, all ioservers not yet initialized
        srun: launching StepId=320.0 on host ec0, 4 tasks: [0-3]
        srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 0
        srun: launching StepId=320.0 on host ec1, 4 tasks: [4-7]
        srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 1
        srun: debug2: Called _file_writable
        srun: debug3:   false
        srun: debug3:   eof is false
        srun: debug2: Called _file_writable
        srun: debug3:   false
        srun: debug3:   eof is false
        srun: debug3: Called _listening_socket_readable
        srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
        srun: route/default: init: route default plugin loaded
        srun: debug3: Success.
        srun: debug3: Tree sending to ec0
        srun: debug2: Tree head got back 0 looking for 2
        srun: debug3: Tree sending to ec1
        srun: error: slurm_get_port: Address family '0' not supported
        srun: error: Error connecting, bad data: family = 0, port = 0
        srun: debug3: problems with ec1
        srun: error: slurm_get_port: Address family '0' not supported
        srun: error: Error connecting, bad data: family = 0, port = 0
        srun: debug3: problems with ec0
        srun: debug2: Tree head got back 2
        srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
        srun: error: Task launch for StepId=320.0 failed on node ec1: 
Communication connection failure
        srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
        srun: error: Task launch for StepId=320.0 failed on node ec0: 
Communication connection failure
        srun: error: Application launch failed: Communication connection failure
        srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
        srun: error: Timed out waiting for job step to complete

It looks like the problem is inability to get correct addresses for the nodes 
in order to send data to them, but rather than a failure to translate the 
hostnames to addresses with DNS (which should work on these nodes), it appears 
that the slurm code in `srun` thinks it already has addresses, and attempts to 
use them even though they are in some uninitialized or partially initialized 
state (`ss_family` == 0). 

Reply via email to