I'm working on setting up a cloud partition, and running into some communications problems between my nodes. This looks like something I have misconfigured, or information I haven't correctly supplied to slurm, but the low-level nature of the error has made it hard for me to figure out what I've done wrong.
I have a batch script which is essentially: #!/bin/sh #SBATCH --time=2 #SBATCH --partition=cloud #SBATCH --ntasks=8 #SBATCH --cpus-per-task=1 srun -vvvvv --slurmd-debug=verbose singularity exec my-image.sif some args Submitting this with `sbatch`, two 4 core VM nodes are started up as expected, the batch script is sent to one, and begins executing the `srun`. That seems to allocate the necessary job steps, but then fails when trying to communicate with the nodes in the allocation to start the tasks: srun: jobid 320: nodes(2):`ec[0-1]', cpu counts: 4(x2) srun: debug2: creating job with 8 tasks srun: debug: requesting job 320, user 1000, nodes 2 including ((null)) srun: debug: cpus 8, tasks 8, name singularity, relative 65534 srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug4: eio: handling events for 2 objects srun: debug3: eio_message_socket_readable: shutdown 0 fd 9 srun: debug3: eio_message_socket_readable: shutdown 0 fd 5 srun: debug: initialized stdio listening socket, port 43793 srun: debug: Started IO server thread (139796182816512) srun: debug: Entering _launch_tasks srun: debug3: IO thread pid = 1507 srun: debug4: eio: handling events for 4 objects srun: debug2: Called _file_readable srun: debug3: false, all ioservers not yet initialized srun: launching StepId=320.0 on host ec0, 4 tasks: [0-3] srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 0 srun: launching StepId=320.0 on host ec1, 4 tasks: [4-7] srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 1 srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug2: Called _file_writable srun: debug3: false srun: debug3: eof is false srun: debug3: Called _listening_socket_readable srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so srun: route/default: init: route default plugin loaded srun: debug3: Success. srun: debug3: Tree sending to ec0 srun: debug2: Tree head got back 0 looking for 2 srun: debug3: Tree sending to ec1 srun: error: slurm_get_port: Address family '0' not supported srun: error: Error connecting, bad data: family = 0, port = 0 srun: debug3: problems with ec1 srun: error: slurm_get_port: Address family '0' not supported srun: error: Error connecting, bad data: family = 0, port = 0 srun: debug3: problems with ec0 srun: debug2: Tree head got back 2 srun: debug: launch returned msg_rc=1001 err=1001 type=9001 srun: error: Task launch for StepId=320.0 failed on node ec1: Communication connection failure srun: debug: launch returned msg_rc=1001 err=1001 type=9001 srun: error: Task launch for StepId=320.0 failed on node ec0: Communication connection failure srun: error: Application launch failed: Communication connection failure srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete It looks like the problem is inability to get correct addresses for the nodes in order to send data to them, but rather than a failure to translate the hostnames to addresses with DNS (which should work on these nodes), it appears that the slurm code in `srun` thinks it already has addresses, and attempts to use them even though they are in some uninitialized or partially initialized state (`ss_family` == 0).