Re: [slurm-users] Unconfigured GPUs being allocated

2023-08-02 Thread Christopher Samuel

On 7/14/23 1:10 pm, Wilson, Steven M wrote:

It's not so much whether a job may or may not access the GPU but rather 
which GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what 
controls what our CUDA jobs can see and therefore use (within any 
cgroups constraints, of course). In my case, Slurm is sometimes setting 
$CUDA_VISIBLE_DEVICES to a GPU that is not in the Slurm configuration 
because it is intended only for driving the display and not GPU 
computations.


Sorry I didn't see this before! Yeah that does sound different, I 
wouldn't expect that. :-(


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-19 Thread Wilson, Steven M
I found that this is actually a known bug in Slurm so I'll note it here in case 
anyone comes across this thread in the future:
  https://bugs.schedmd.com/show_bug.cgi?id=10598

Steve

From: slurm-users  on behalf of Wilson, 
Steven M 
Sent: Tuesday, July 18, 2023 5:32 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

Further testing and looking at the source code confirms what looks to me like a 
bug in Slurm. GPUs that are not configured in gres.conf are detected by slurmd 
in the system and discarded since they aren't found in gres.conf. That's fine 
except they should also be hidden through cgroup control so that they aren't 
visible along with allocated GPUs when a job is run. Slurm assumes that the job 
can only see the GPUs that it allocates to the job and sets the 
$CUDA_VISIBLE_DEVICES accordingly. Unfortunately, the job actually sees the 
allocated GPUs plus any unconfigured GPUs and $CUDA_VISIBLE_DEVICES may or may 
not happen to correspond to the GPU(s) allocated by Slurm.

I was hoping that I could write a Prolog script that would adjust 
$CUDA_VISIBLE_DEVICES to remove any unconfigured GPUs but any changes using 
"export CUDA_VISIBLE_DEVICES=..." don't seem to have an effect upon the actual 
environment of the job.

Steve


From: Wilson, Steven M 
Sent: Friday, July 14, 2023 4:10 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

It's not so much whether a job may or may not access the GPU but rather which 
GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what 
our CUDA jobs can see and therefore use (within any cgroups constraints, of 
course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU 
that is not in the Slurm configuration because it is intended only for driving 
the display and not GPU computations.

Thanks for your thoughts!

Steve

From: slurm-users  on behalf of 
Christopher Samuel 
Sent: Friday, July 14, 2023 1:57 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[You don't often get email from ch...@csamuel.org. Learn why this is important 
at https://aka.ms/LearnAboutSenderIdentification ]

 External Email: Use caution with attachments, links, or sharing data 


On 7/14/23 10:20 am, Wilson, Steven M wrote:

> I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
> Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
> being made available to jobs so we end up with compute jobs being run on
> GPUs which should only be used

I think this is expected - it's not that Slurm is making them available,
it's that it's unaware of them and so doesn't control them in the way it
does for the GPUs it does know about. So you get the default behaviour
(any process can access them).

If you want to stop them being accessed from Slurm you'd need to find a
way to prevent that access via cgroups games or similar.

All the best,
Chris
--
Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=05%7C01%7Cstevew%40purdue.edu%7C6fba97485b73413521d208db8494160a%7C4130bd397c53419cb1e58758d6d63f21%7C0%7C0%7C638249543794377751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=VslW51ree1Ibt3xfYyy99Aj%2BREZh7BqpM6Ipg3jAM84%3D=0<http://www.csamuel.org/>
  :  Berkeley, CA, USA




Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-18 Thread Wilson, Steven M
Further testing and looking at the source code confirms what looks to me like a 
bug in Slurm. GPUs that are not configured in gres.conf are detected by slurmd 
in the system and discarded since they aren't found in gres.conf. That's fine 
except they should also be hidden through cgroup control so that they aren't 
visible along with allocated GPUs when a job is run. Slurm assumes that the job 
can only see the GPUs that it allocates to the job and sets the 
$CUDA_VISIBLE_DEVICES accordingly. Unfortunately, the job actually sees the 
allocated GPUs plus any unconfigured GPUs and $CUDA_VISIBLE_DEVICES may or may 
not happen to correspond to the GPU(s) allocated by Slurm.

I was hoping that I could write a Prolog script that would adjust 
$CUDA_VISIBLE_DEVICES to remove any unconfigured GPUs but any changes using 
"export CUDA_VISIBLE_DEVICES=..." don't seem to have an effect upon the actual 
environment of the job.

Steve


From: Wilson, Steven M 
Sent: Friday, July 14, 2023 4:10 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

It's not so much whether a job may or may not access the GPU but rather which 
GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what 
our CUDA jobs can see and therefore use (within any cgroups constraints, of 
course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU 
that is not in the Slurm configuration because it is intended only for driving 
the display and not GPU computations.

Thanks for your thoughts!

Steve

From: slurm-users  on behalf of 
Christopher Samuel 
Sent: Friday, July 14, 2023 1:57 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[You don't often get email from ch...@csamuel.org. Learn why this is important 
at https://aka.ms/LearnAboutSenderIdentification ]

 External Email: Use caution with attachments, links, or sharing data 


On 7/14/23 10:20 am, Wilson, Steven M wrote:

> I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
> Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
> being made available to jobs so we end up with compute jobs being run on
> GPUs which should only be used

I think this is expected - it's not that Slurm is making them available,
it's that it's unaware of them and so doesn't control them in the way it
does for the GPUs it does know about. So you get the default behaviour
(any process can access them).

If you want to stop them being accessed from Slurm you'd need to find a
way to prevent that access via cgroups games or similar.

All the best,
Chris
--
Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=05%7C01%7Cstevew%40purdue.edu%7C6fba97485b73413521d208db8494160a%7C4130bd397c53419cb1e58758d6d63f21%7C0%7C0%7C638249543794377751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=VslW51ree1Ibt3xfYyy99Aj%2BREZh7BqpM6Ipg3jAM84%3D=0<http://www.csamuel.org/>
  :  Berkeley, CA, USA




Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Wilson, Steven M
I haven't seen anything that allows for disabling a defined Gres device. It 
does seem to work if I define the GPUs that I don't want to use and then 
specifically submit jobs to the other GPUs using --gpu like 
"--gpu=gpu:rtx_2080_ti:1". I suppose if I set the GPU Type to be "COMPUTE" for 
the GPUs I want to use for computing and "UNUSED" for those that I don't, this 
scheme might work (e.g., --gpu=gpu:COMPUTE:3). But then every job submission 
would be required to have this option set. Not a very workable solution.

Thanks!
Steve

From: slurm-users  on behalf of Feng 
Zhang 
Sent: Friday, July 14, 2023 3:09 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[Some people who received this message don't often get email from 
prod.f...@gmail.com. Learn why this is important at 
https://aka.ms/LearnAboutSenderIdentification ]

 External Email: Use caution with attachments, links, or sharing data 


Very interesting issue.

I am guessing there might be a workaround: SInce oryx has 2 gpus
instead, you can define both of them, but disable the GT 710? Does
Slurm support this?

Best,

Feng

Best,

Feng


On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M  wrote:
>
> Hi,
>
> I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
> gres.conf) and everything works fine when all the GPUs in a node are 
> configured in gres.conf and available to Slurm.  But we have some nodes where 
> a GPU is reserved for running the display and is specifically not configured 
> in gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes 
> it available to Slurm jobs.  Using a simple Slurm job that executes 
> "nvidia-smi -L", it will display the unconfigured GPU along with as many 
> configured GPUs as requested by the job.
>
> For example, in a node configured with this line in slurm.conf:
> NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
> and this line in gres.conf:
> Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
> I will get the following results from a job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA GeForce GT 710 (UUID: 
> GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
> GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
> GPU-0dc4da58-5026-6173-1156-c4559a268bf5)
>
> But in another node that has all GPUs configured in Slurm like this in 
> slurm.conf:
> NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 
> Gres=gpu:TITANX:2
> and this line in gres.conf:
> Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
> I get the expected results from the job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)
>
> I'm running Slurm 22.05.5.
>
> Thanks in advance for any suggestions to help correct this problem!
>
> Steve



Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Wilson, Steven M
It's not so much whether a job may or may not access the GPU but rather which 
GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what 
our CUDA jobs can see and therefore use (within any cgroups constraints, of 
course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU 
that is not in the Slurm configuration because it is intended only for driving 
the display and not GPU computations.

Thanks for your thoughts!

Steve

From: slurm-users  on behalf of 
Christopher Samuel 
Sent: Friday, July 14, 2023 1:57 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[You don't often get email from ch...@csamuel.org. Learn why this is important 
at https://aka.ms/LearnAboutSenderIdentification ]

 External Email: Use caution with attachments, links, or sharing data 


On 7/14/23 10:20 am, Wilson, Steven M wrote:

> I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
> Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
> being made available to jobs so we end up with compute jobs being run on
> GPUs which should only be used

I think this is expected - it's not that Slurm is making them available,
it's that it's unaware of them and so doesn't control them in the way it
does for the GPUs it does know about. So you get the default behaviour
(any process can access them).

If you want to stop them being accessed from Slurm you'd need to find a
way to prevent that access via cgroups games or similar.

All the best,
Chris
--
Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=05%7C01%7Cstevew%40purdue.edu%7C6fba97485b73413521d208db8494160a%7C4130bd397c53419cb1e58758d6d63f21%7C0%7C0%7C638249543794377751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=VslW51ree1Ibt3xfYyy99Aj%2BREZh7BqpM6Ipg3jAM84%3D=0<http://www.csamuel.org/>
  :  Berkeley, CA, USA




Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Feng Zhang
Very interesting issue.

I am guessing there might be a workaround: SInce oryx has 2 gpus
instead, you can define both of them, but disable the GT 710? Does
Slurm support this?

Best,

Feng

Best,

Feng


On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M  wrote:
>
> Hi,
>
> I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
> gres.conf) and everything works fine when all the GPUs in a node are 
> configured in gres.conf and available to Slurm.  But we have some nodes where 
> a GPU is reserved for running the display and is specifically not configured 
> in gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes 
> it available to Slurm jobs.  Using a simple Slurm job that executes 
> "nvidia-smi -L", it will display the unconfigured GPU along with as many 
> configured GPUs as requested by the job.
>
> For example, in a node configured with this line in slurm.conf:
> NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
> and this line in gres.conf:
> Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
> I will get the following results from a job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA GeForce GT 710 (UUID: 
> GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
> GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
> GPU-0dc4da58-5026-6173-1156-c4559a268bf5)
>
> But in another node that has all GPUs configured in Slurm like this in 
> slurm.conf:
> NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 
> Gres=gpu:TITANX:2
> and this line in gres.conf:
> Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
> I get the expected results from the job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)
>
> I'm running Slurm 22.05.5.
>
> Thanks in advance for any suggestions to help correct this problem!
>
> Steve



Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Christopher Samuel

On 7/14/23 10:20 am, Wilson, Steven M wrote:

I upgraded Slurm to 23.02.3 but I'm still running into the same problem. 
Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still 
being made available to jobs so we end up with compute jobs being run on 
GPUs which should only be used


I think this is expected - it's not that Slurm is making them available, 
it's that it's unaware of them and so doesn't control them in the way it 
does for the GPUs it does know about. So you get the default behaviour 
(any process can access them).


If you want to stop them being accessed from Slurm you'd need to find a 
way to prevent that access via cgroups games or similar.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Wilson, Steven M
I upgraded Slurm to 23.02.3 but I'm still running into the same problem. 
Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still being 
made available to jobs so we end up with compute jobs being run on GPUs which 
should only be used

Any ideas?

Thanks,
Steve

From: Wilson, Steven M
Sent: Tuesday, June 27, 2023 9:50 AM
To: slurm-users@lists.schedmd.com 
Subject: Unconfigured GPUs being allocated

Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
gres.conf) and everything works fine when all the GPUs in a node are configured 
in gres.conf and available to Slurm.  But we have some nodes where a GPU is 
reserved for running the display and is specifically not configured in 
gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it 
available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi 
-L", it will display the unconfigured GPU along with as many configured GPUs as 
requested by the job.

For example, in a node configured with this line in slurm.conf:
NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that 
requested a single GPU:
GPU 0: NVIDIA GeForce GT 710 (UUID: 
GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in 
slurm.conf:
NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested 
a single GPU:
GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve


[slurm-users] Unconfigured GPUs being allocated

2023-06-27 Thread Wilson, Steven M
Hi,

I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
gres.conf) and everything works fine when all the GPUs in a node are configured 
in gres.conf and available to Slurm.  But we have some nodes where a GPU is 
reserved for running the display and is specifically not configured in 
gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes it 
available to Slurm jobs.  Using a simple Slurm job that executes "nvidia-smi 
-L", it will display the unconfigured GPU along with as many configured GPUs as 
requested by the job.

For example, in a node configured with this line in slurm.conf:
NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
and this line in gres.conf:
Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
I will get the following results from a job running "nvidia-smi -L" that 
requested a single GPU:
GPU 0: NVIDIA GeForce GT 710 (UUID: 
GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
GPU-0dc4da58-5026-6173-1156-c4559a268bf5)

But in another node that has all GPUs configured in Slurm like this in 
slurm.conf:
NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 Gres=gpu:TITANX:2
and this line in gres.conf:
Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
I get the expected results from the job running "nvidia-smi -L" that requested 
a single GPU:
GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)

I'm running Slurm 22.05.5.

Thanks in advance for any suggestions to help correct this problem!

Steve