I haven't tried configless setup yet, but the problem you're hitting looks like 
it could be a DNS issue. Can you do a dns lookup of n26 from the login node? 
The way that non-interactive batch jobs are started may not require that, but I 
believe that it is required for interactive jobs. 

Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research 
Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrobb...@mines.edu  

Our values: Trust | Integrity | Respect | Responsibility

On 3/22/21, 11:24, "slurm-users on behalf of Josef Dvoracek" 
<slurm-users-boun...@lists.schedmd.com on behalf of j...@fzu.cz> wrote:

    Hi @list;

    I was able to configure "configless" slurm cluster with quite 
    minimalistic slurm.conf everywhere, of-course excepting slurmctld 
    server. All nodes are running slurmd, including front-end/login nodes to 
    pull the config.

    Submitting jobs using sbatch scripts works fine, but interactive jobs 
    using srun are failing with

    $ srun --verbose -w n26 --pty /bin/bash
    ...
    srun: error: fwd_tree_thread: can't find address for host n26, check 
    slurm.conf
    srun: error: Task launch for 200137.0 failed on node n26: Can't find an 
    address, check slurm.conf
    srun: error: Application launch failed: Can't find an address, check 
    slurm.conf
    ...


    Does it mean that on submit hosts one has to manually specify all 
    relevant NodeNames?
    I thought that running slurmd there will pull configuration from 
    slurmserver. (I can see the file is actually sucessfully pulled into 
    /run/slurm/conf/slurm.conf ).


    So far I found two workarounds:

    workaround1:

    specify nodenames at login/front-end nodes in slurm.conf:

    NodeName=n[(...)n26(...)] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 
    State=UNKNOWN

    then, srun works as expected.


    workaround2:

    directing environment variable SLURM_CONF to the slurm.conf pulled by 
    slurmd:

    export SLURM_CONF=/run/slurm/conf/slurm.conf

    then again, srun works as expected.


    Is this expected behavior? I actually expected that srun at configless 
    login/front-end node with running slurmd recognizes the pulled 
    configuration, but apparently, that's not the case.

    cheers

    josef


    setup at front-end and compute nodes:

    [root@FRONTEND ~]# slurmd --version
    slurm 20.02.5
    [root@FRONTEND ~]#

    [root@FRONTEND ~]# cat /etc/sysconfig/slurmd
    SLURMD_OPTIONS="--conf-server slurmserver2.DOMAIN"
    [root@FRONTEND ~]#

    [root@FRONTEND ~]# cat /etc/slurm/slurm.conf
    ClusterName=CLUSTERNAME
    ControlMachine=slurmserver2.DOMAIN
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=slurmserver2.DOMAIN
    AccountingStoragePort=7031
    SlurmctldParameters=enable_configless
    [root@FRONTEND ~]#












    ClusterName=XXXXX
    ControlMachine=slurmserver2.DOMAIN
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=slurmserver2.DOMAIN
    AccountingStoragePort=7031
    SlurmctldParameters=enable_configless



    -- 
    Josef Dvoracek
    Institute of Physics | Czech Academy of Sciences
    cell: +420 608 563 558 | https://telegram.me/jose_d | FZU phone nr. : 2669


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to