[slurm-dev] Re: building slurm with rpmbuild and hwloc support

2014-10-27 Thread Chrysovalantis Paschoulas
Hi! You need 2 packages installed on the system(in my case it is a RHEL based distro) where you build Slurm: hwloc and hwloc-devel. And also you don't need the .rpmmacros for hwloc if you have installed these packages. By default this option is enabled. ;) Btw, I have never used a custom hwloc

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

2014-10-27 Thread Chrysovalantis Paschoulas
Hi! For sure this is not connected to Slurm, but it is a problem with your Infiband+IMPI configuration. You should go to other forums or mailing lists and ask for help ;) At first, I would suggest you to configure correctly the dat.conf file. In my case it is "/etc/dat.conf". You have to comm

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

2014-10-27 Thread Tingyang Xu
Thank you very much, Chrysovalantis. I just created a topic in Intel forum though your suggestion did not fix our issue. I will also update this topic if I get the solution in case other slurm users may have the similar issue again. Thanks, Tingyang Xu From: Chrysovalantis Paschoulas Sent: Mon

[slurm-dev] logrotate causing job authentication failure

2014-10-27 Thread E V
Had 2 jobs die yesterday morning with a slurm_load_jobs error: Protocol authentication error from inside DRMAA, and this interesting message in the log: If munged is up, restart with --num-threads=10 error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or dir

[slurm-dev] Re: logrotate causing job authentication failure

2014-10-27 Thread jette
Slurm already has connect retry logic (10 times with 0.1 sec between retries). DRMAA should need no changes unless it directly accesses munge. Has anyone else seen this problem? Quoting E V : Had 2 jobs die yesterday morning with a slurm_load_jobs error: Protocol authentication error fro

[slurm-dev] Re: logrotate causing job authentication failure

2014-10-27 Thread E V
Looking at the DRMAA code it appears false was returned from calling slurm_load_job( &job_info, fsd_atoi(self->job_id), SHOW_ALL), which triggered the error output and stack dump. Haven't looked at the code for slurm_load_job to see if it's doing anything different. I'm using 14.03.08, FYI. On Mo

[slurm-dev] reccomended software stack for development?

2014-10-27 Thread Manuel Rodríguez Pascual
Hi all, I have the intention of working on Slurm, modifying it to satisfy my needs and (hopefully) include some new functionalities. I am however kind of newbie with this kind of software development, so I am writing looking for advise. My question is, can you recommend me any tools for the develo

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread Andy Riebs
Hi Manuel, The first rule is "Keep it simple!" I suggest that you start by viewing this as 2 problems: 1. Learning how to work with Slurm 2. Learning how to work with clusters For learning how to work with Slurm, cloning a copy of the repo is a good start.  In the "Developers" note

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread Trey Dockendorf
I wouldn't count what I've done as production-ready but I have a Puppet module for BLCR [1] and one for SLURM [2]. Also there's one for managing SLURM QOS and clusters using native Puppet types [3]. They likely won't aid in development as the two SLURM related modules both assume you have build R

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread rf
> "Manuel" == Manuel Rodríguez Pascual > writes: Hi Manuel, Manuel> Hi all, I have the intention of working on Slurm, modifying Manuel> it to satisfy my needs and (hopefully) include some new Manuel> functionalities. I am however kind of newbie with this kind Manuel> of

[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

2014-10-27 Thread Ryan Cox
Trey, I'm not sure why your jobs aren't starting. Someone else will have to answer that question. You can model an organizational hierarchy a lot better in 14.11 due to changes in Fairshare=parent for accounts. If you only want fairshare to matter at the research group and user levels but