Hi Lyn, Unfortunately, rebooting the node makes no difference to the state of the node. The job gets re-queued and the node goes back to 'mix~'. What baffles me is that there is obviously some sort of communication problem between the slurmctld on the admin node and the slurmd on the compute node, but I can't find anything in the log files to indicate what's going wrong.
Cheers, Loris Lyn Gerner <schedulerqu...@gmail.com> writes: > Re: [slurm-dev] Job stuck in CONFIGURING, node is 'mix~' > > Hi Loris, > > At least with earlier releases, I've not found a way to act directly upon the > job. However, if it's possible to down the node, that should requeue (or > cancel) the job. > > Best, > Lyn > > On Tue, Sep 12, 2017 at 3:40 AM, Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > Hi, > > I have a node which is powered on and to which I have sent a job. The > output of sinfo is > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > test up 7-00:00:00 1 mix~ node001 > > The output of squeue is > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1795993 test 7_single loris CF 24:29 1 node001 > > I don't understand the node state 'mix~'. If at all, I would only > expect it to exist very briefly between 'idle~' and 'mix#'. The '~' is > certainly incorrect, as the node is not in a power-saving state, which > in our case is powered-off. > > This problem may have existed in 16.05.10-2, but currently we are using > 17.02.7. All other nodes in the cluster apart from one are functioning > normally. > > Does anyone have any idea what we might be doing wrong? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > > -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de