Re: [slurm-users] slurm_pam_adapt & configless - set-up

2020-12-02 Thread Ole Holm Nielsen
Hi Frank, You must update Slurm to a more recent version due to a configless bug that existed in early versions of 20.02. /Ole On 02-12-2020 15:50, Heckes, Frank wrote: Hello all, sorry if this has been asked and/or answered before. I couldn’t find a posting related to my problem. I’m

Re: [slurm-users] FairShare

2020-12-02 Thread Paul Edmon
Yup, our doc is for the classic fairshare not for fairtree. Thanks for the kudos on the doc by the way.  We are glad it is useful. -Paul Edmon- On 12/2/2020 12:45 PM, Ryan Cox wrote: That is not for Fair Tree, which is what Micheal asked about. Ryan On 12/2/20 10:32 AM, Renfro, Michael

Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
You seem to be saying sort the users in my account ((rank-1)/user count)=FS (no subaccounts). But that doesn't calculate the FS values I'm seeing. I still see no way to calculate ~FS. From: slurm-users on behalf of Micheal Krombopulous Sent: Wednesday,

Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
Yes, that concept of rank tripped me up. The "count of user associations that start at root" you mean? Do you mean all associations across all accounts or just the account being examined? Then you say "final fairshare factor similar to (rank-- / user_assoc_count)". Wouldn't that equal 1? I'm

Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
That is not for Fair Tree, which is what Micheal asked about. Ryan On 12/2/20 10:32 AM, Renfro, Michael wrote: Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/

Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
From https://slurm.schedmd.com/fair_tree.html: The basic idea is to set rank equal to the count of user associations then start at root: *   Calculate Level Fairshare for the subtree's children *   Sort children of the subtree *   Visit the children in descending order: -    If user, assign

Re: [slurm-users] FairShare

2020-12-02 Thread Erik Bryer
I read that link. If Fair Share is so rational (low users get high scores, and high users get low scores), then why do ajoel's and xtsao's Fair Share scores differ this much? Their Level Fair Share scores make more sense. >sray ajoel 10.05 42449

Re: [slurm-users] FairShare

2020-12-02 Thread Erik Bryer
I'm not talking about the Level Fair Share. That's easy to compute. I'm talking about Fair Share -- what sshare prints out on the rightmost side. From: slurm-users on behalf of Ryan Cox Sent: Wednesday, December 2, 2020 10:31 AM To: Slurm User Community List ;

Re: [slurm-users] FairShare

2020-12-02 Thread Renfro, Michael
Yesterday, I posted

Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
It's really similar to a binary search tree.  Within each account, it is Shares / Usage to calculate the Level FS.  See https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting at page 34 or so.  It even has an "animation". Ryan On 12/2/20 10:22 AM, Micheal Krombopulous

Re: [slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
I've read the manual and I re-read the other link. What they boil down to is Fair Share is calculated based on a recondite "rooted plane tree", which I do not have the background in discrete math to understand. I'm hoping someone can explain it so my little kernel can understand.

Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
Micheal, Details are at https://slurm.schedmd.com/fair_tree.html . If they have the same shares and usage as each other, they will have the same fair share value.  One thing to keep in mind is that sshare rounds or truncates the values, so 0.00

[slurm-users] FairShare

2020-12-02 Thread Micheal Krombopulous
Can someone tell me how to calculate fairshare (under fairtree)? I can't figure it out. I would have thought it would be the same score for all users in an account. E.g., here is one of my accounts: Account User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage    LevelFS  

Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco
On 12/2/20 4:18 PM, Paul Edmon wrote: You can dig through the slurmctld log and search for the JobID. That should tell you what Slurm was doing at the time. Aha, thanks a lot! Found the culprit: [2020-12-02T06:45:14.200] error: Nodes issaf-0-1 not responding [2020-12-02T06:45:28.212] requeue

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-02 Thread Robert Kudyba
> > been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It > seems to have started to occur when I enabled proctrack/cgroup and changed > select/linear to select/con_tres. > Our slurm.conf has the same setting: SelectType=select/cons_tres SelectTypeParameters=CR_CPU

[slurm-users] slurm_pam_adapt & configless - set-up

2020-12-02 Thread Heckes, Frank
Hello all, sorry if this has been asked and/or answered before. I couldn’t find a posting related to my problem. I’m using slurm 20.02.01 and use a configless – set-up for all login and compute nodes. I set-up slurm PAM on a test node following the instructions at

Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Paul Edmon
You can dig through the slurmctld log and search for the JobID. That should tell you what Slurm was doing at the time. -Paul Edmon- On 12/2/2020 6:27 AM, Adrian Sevcenco wrote: Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1

Re: [slurm-users] Randomize Slurm Node Allocation

2020-12-02 Thread Adrian Sevcenco
On 12/2/20 1:27 PM, Fabio Moreira wrote: Hi, I would like to know if Slurm has any configuration to enable a randomize node allocation, since we have 256 nodes in our cluster and the first nodes are always allocated at first. Is there any way to allocate them in an aleatory way? We have already

[slurm-users] Randomize Slurm Node Allocation

2020-12-02 Thread Fabio Moreira
Hi, I would like to know if Slurm has any configuration to enable a randomize node allocation, since we have 256 nodes in our cluster and the first nodes are always allocated at first. Is there any way to allocate them in an aleatory way? We have already added the option "LLN=YES" to the

[slurm-users] job restart :: how to find the reason

2020-12-02 Thread Adrian Sevcenco
Hi! I encountered a situation when a bunch of jobs were restarted and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 So, i would like to know, how i can i find why there is a Requeue (when there is only one partition defined) and why there is a restart .. Thanks a

Re: [slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-02 Thread Yair Yarom
Hi, We also noticed this. We eventually placed the max time on the HealthCheckInterval (65535), and created a systemd.timer which runs the scripts externally of slurm, with proper intervals and randomized delays. Yair. On Wed, Dec 2, 2020 at 9:03 AM wrote: > Hello, > > > > Our slurm