Re: [slurm-users] SLURM_TMPDIR

2019-12-09 Thread Angelines
Hi Roger thanks for your answer but it doesn't work in our case and I don't understand why. Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto

[slurm-users] nss_slurm and sudo

2019-12-09 Thread Brian Andrus
So it seems nss_slurm does not play well with sudo. If I connect to a box that uses it and try to use sudo, I get: *sudo: PAM account management error: Authentication service cannot retrieve authentication info* Has anyone else seen this? Is there a workaround? Brian Andrus

Re: [slurm-users] question about partition definition

2019-12-09 Thread Brian W. Johanson
Jeff, Create a qos with maxjobs defined. https://slurm.schedmd.com/qos.html https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#quality-of-service-qos If you haven't used slurm qos before, you may want to check out the other limits possible, they are more flexible than maxjobs. Add the qos to

Re: [slurm-users] question about partition definition

2019-12-09 Thread Brian Haymore
Have you looked at the limits you can set at the QOS or Account level in slurmdbd? There seems to be better granularity at those levels from what I've seen. -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 8

[slurm-users] question about partition definition

2019-12-09 Thread Jeffrey R. Lang
I need to set up a partition that limits the number of jobs allowed to run at one time. Looking at the slurm.conf page for partition definitions I don't see a MaxJobs option. Is there a way to limit the number of jobs in a partition? Thanks, Jeff

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Alex Chekholko
I had found some inconsistent behavior with the epilog that I didn't understand, but we worked around it at our site and didn't follow up. https://bugs.schedmd.com/show_bug.cgi?id=6911 On Mon, Dec 9, 2019 at 11:58 AM Brian Andrus wrote: > Absolutely, which we do, however it is difficult to simul

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Brian Andrus
Absolutely, which we do, however it is difficult to simulate all the possible job failures/ending/cancellations in 2 minutes or at all for some things. So we post on the forums both to see if this has been found out and to draw attention to the fact that the documentation could be improved by

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Riebs, Andy
At the risk of stating the obvious… these seem like the sort of questions that could be answered with a 2 minute test. Better yet, not just answered, but with answers specific to your configuration ☺ From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Alex Chekholko Se

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Alex Chekholko
Hi, I had asked a similar question recently (maybe a year ago) and also got crickets. I think in our case we were not able to ensure that the epilog always ran for different types of job failures, so we just had the users add some more cleanup code to the end of their jobs _and_ also run separate

[slurm-users] NoInAddrAny and NodeAddr

2019-12-09 Thread Daniel Ahlin
Hi, We would like to bind slurm to a specific address and thought NodeAddr=1.2.3.4 CommunicationParameters=NoInAddrAny would be a good idea. However the manpage says: "NoInAddrAny - Used to directly bind to the address of what the node resolves to instead of binding messages to any addre

Re: [slurm-users] Question about networks and connectivity

2019-12-09 Thread Jeffrey T Frey
Open MPI matches available hardware in node(s) against its compiled-in capabilities. Those capabilities are expressed as modular shared libraries (see e.g. $PREFIX/lib64/openmpi). You can use environment variables or command-line flags to influence which modules get used for specific purposed.

Re: [slurm-users] Question about networks and connectivity

2019-12-09 Thread Sysadmin CAOS
Hi mercan, OK, I forgot to compile OpenMPI with Infiniband support... But I still have a doubt: SLURM scheduler assigns (offers) some nodes called "node0x" to my sbatch job because in my SLURM cluster nodes have been added with "node0x" name. My OpenMPI application has been (now) compiled wit