Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-29 Thread Rahmanpour Koushki, Maysam
Thank you for the responses.


In response to some of the suggestions, I would like to provide further details 
on my specific use case. I am currently focused on exploring the concept of 
malleable jobs, which possess the ability to adapt their computing resources 
during runtime.

To tackle the MPI incompatibility issue associated with malleable jobs, There 
are solutions like Flex-MPI which extends the functionality of MPI to support 
resource adaptivity for malleable jobs during runtime. Furthermore, There are 
scheduling algorithms tailored for malleable jobs. These algorithms aim to 
efficiently allocate resources and optimize job scheduling based on the dynamic 
nature of malleable jobs.

My primary objective is to understand how Slurm can effectively support 
malleable jobs. So I am investigating to find out how can SLURM support expand 
and shrink nodes during runtime.


Best Regards


Maysam



From: slurm-users  on behalf of Diego 
Zuccato 
Sent: Wednesday, June 28, 2023 4:15:44 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in 
Slurm

IIUC it's not possible to increase resource usage once the job is
started: it would mess the scheduler and MPI comms (probably).

But I also think you're trying to find a problem for a "solution". Just
state the problem you're facing instead of proposing a solution :)
What software are you running? How does it detect that a resize is
needed? How would it handle the expansion?

Diego

Il 28/06/2023 13:02, Rahmanpour Koushki, Maysam ha scritto:
> Dear Slurm Mailing List,
>
>
> I hope this email finds you well. I am currently working on a project
> that requires the ability to dynamically shrink or expand nodes for
> running jobs in Slurm. However, I am facing some challenges and would
> greatly appreciate your assistance and expertise in finding a solution.
>
> In my research, I came across the following resources:
>
>  1.
>
> Slurm Advanced Usage Tutorial: I found a tutorial
> (https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf
> <https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf>) 
> that discusses advanced features of Slurm. It mentions the possibility of 
> assigning and deassigning nodes to a job, which is exactly what I need. 
> However, the tutorial refers to the FAQ for more detailed information.
>
>  2.
>
> Stack Overflow Question: I also came across a related question on
> Stack Overflow
> 
> (https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm
>  
> <https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm>)
>  that discusses updating the node number for a job in Slurm. The answer 
> suggests that it is indeed possible, but again, it refers to the FAQ for 
> further details.
>
> Upon reviewing the current FAQ, I found that it states node shrinking is
> only possible for pending jobs. Unfortunately, it does not provide
> additional information or examples to clarify if this functionality can
> be extended to running jobs.
>
> I would be grateful if anyone could provide insight into the following:
>
>  1.
>
> Is it possible to dynamically shrink or expand nodes for running
> jobs in Slurm? If so, how can it be achieved?
>
>  2.
>
> Are there any alternative methods or workarounds to accomplish
> dynamic node scaling for running jobs in Slurm?
>
> I kindly request your guidance, personal experiences, or any relevant
> resources that could shed light on this topic. Your expertise and
> assistance would greatly help me in successfully completing my project.
>
> Thank you in advance for your time and support.
>
> Best regards,
>
>
> Maysam
>
>
> Johannes Gutenberg University of Mainz
>
>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Chris Samuel

On 28/6/23 04:02, Rahmanpour Koushki, Maysam wrote:

Upon reviewing the current FAQ, I found that it states node shrinking is 
only possible for pending jobs. Unfortunately, it does not provide 
additional information or examples to clarify if this functionality can 
be extended to running jobs.


You can definitely release nodes from a running job, what I believe the 
FAQ is saying is you cannot do something like change the number of cores 
per node or memory you requested once a job is running.


As for why you'd do that, we've had people who (before we set up a 
mechanism to automatically reboot nodes to address this) would request 
more nodes than they needed, look for how fragmented kernel hugepages 
were and then exclude nodes where there were too many fragmented for 
their needs.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Diego Zuccato
IIUC it's not possible to increase resource usage once the job is 
started: it would mess the scheduler and MPI comms (probably).


But I also think you're trying to find a problem for a "solution". Just 
state the problem you're facing instead of proposing a solution :)
What software are you running? How does it detect that a resize is 
needed? How would it handle the expansion?


Diego

Il 28/06/2023 13:02, Rahmanpour Koushki, Maysam ha scritto:

Dear Slurm Mailing List,


I hope this email finds you well. I am currently working on a project 
that requires the ability to dynamically shrink or expand nodes for 
running jobs in Slurm. However, I am facing some challenges and would 
greatly appreciate your assistance and expertise in finding a solution.


In my research, I came across the following resources:

 1.

Slurm Advanced Usage Tutorial: I found a tutorial
(https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf
) that 
discusses advanced features of Slurm. It mentions the possibility of assigning and 
deassigning nodes to a job, which is exactly what I need. However, the tutorial 
refers to the FAQ for more detailed information.

 2.

Stack Overflow Question: I also came across a related question on
Stack Overflow

(https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm 
)
 that discusses updating the node number for a job in Slurm. The answer suggests that 
it is indeed possible, but again, it refers to the FAQ for further details.

Upon reviewing the current FAQ, I found that it states node shrinking is 
only possible for pending jobs. Unfortunately, it does not provide 
additional information or examples to clarify if this functionality can 
be extended to running jobs.


I would be grateful if anyone could provide insight into the following:

 1.

Is it possible to dynamically shrink or expand nodes for running
jobs in Slurm? If so, how can it be achieved?

 2.

Are there any alternative methods or workarounds to accomplish
dynamic node scaling for running jobs in Slurm?

I kindly request your guidance, personal experiences, or any relevant 
resources that could shed light on this topic. Your expertise and 
assistance would greatly help me in successfully completing my project.


Thank you in advance for your time and support.

Best regards,


Maysam


Johannes Gutenberg University of Mainz




--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



[slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Rahmanpour Koushki, Maysam
Dear Slurm Mailing List,


I hope this email finds you well. I am currently working on a project that 
requires the ability to dynamically shrink or expand nodes for running jobs in 
Slurm. However, I am facing some challenges and would greatly appreciate your 
assistance and expertise in finding a solution.

In my research, I came across the following resources:

  1.  Slurm Advanced Usage Tutorial: I found a tutorial 
(https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf) that 
discusses advanced features of Slurm. It mentions the possibility of assigning 
and deassigning nodes to a job, which is exactly what I need. However, the 
tutorial refers to the FAQ for more detailed information.

  2.  Stack Overflow Question: I also came across a related question on Stack 
Overflow 
(https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm)
 that discusses updating the node number for a job in Slurm. The answer 
suggests that it is indeed possible, but again, it refers to the FAQ for 
further details.

Upon reviewing the current FAQ, I found that it states node shrinking is only 
possible for pending jobs. Unfortunately, it does not provide additional 
information or examples to clarify if this functionality can be extended to 
running jobs.

I would be grateful if anyone could provide insight into the following:

  1.  Is it possible to dynamically shrink or expand nodes for running jobs in 
Slurm? If so, how can it be achieved?

  2.  Are there any alternative methods or workarounds to accomplish dynamic 
node scaling for running jobs in Slurm?

I kindly request your guidance, personal experiences, or any relevant resources 
that could shed light on this topic. Your expertise and assistance would 
greatly help me in successfully completing my project.

Thank you in advance for your time and support.

Best regards,


Maysam


Johannes Gutenberg University of Mainz