Re: [slurm-users] Intel MPI startup

2019-04-29 Thread Chris Samuel
On Monday, 29 April 2019 8:47:49 AM PDT Michael Robbert wrote: > Intel has supposedly supported PMI-2 since their 2017 release and that > is what SchedMD suggested we use in a recent bug report to them, but I > found that it no longer works in Intel MPI 2019. I opened a bug report > with Intel on

Re: [slurm-users] Job dispatching policy

2019-04-29 Thread Chris Samuel
On Monday, 29 April 2019 5:18:56 AM PDT Mahmood Naderan wrote: > [mahmood@rocks7 ~]$ rocks run host compute-0-1 "file > /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2" Given that file says it's a shell script, try and run it with this to see what doesn't work: rocks run host

[slurm-users] Intel MPI startup

2019-04-29 Thread Michael Robbert
I was curious what startup method other sites are using with Intel MPI? According to the documentation srun with Slurm's PMI is the recommended way ( https://slurm.schedmd.com/mpi_guide.html#intel_srun ). Intel has supposedly supported PMI-2 since their 2017 release and that is what SchedMD

Re: [slurm-users] Job dispatching policy

2019-04-29 Thread Prentice Bisbal
I see two separate, unrelated problems here: Problem 1: Warning: untrusted X11 forwarding setup failed: xauth key data not generated What have you done to investigate this xauth problem further? I know there have been discussions about this problem in the past on this mailing list. Did you

Re: [slurm-users] Job dispatching policy

2019-04-29 Thread Mahmood Naderan
[mahmood@rocks7 ~]$ rocks run host compute-0-1 "file /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2" Warning: untrusted X11 forwarding setup failed: xauth key data not generated /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2: POSIX shell script, ASCII text executable

[slurm-users] scontrol reboot issue

2019-04-29 Thread Marcus Wagner
Dear all, we use "scontrol reboot asap reason= nextstate=resume" to e.g. do a reboot after a kernel update. But I must say, that works SOMETIMES. Often SLURM forgets that there is a maintenance for a node and therefore does not reboot the node: ncg01    DRAINING