[slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Rahmanpour Koushki, Maysam
Dear Slurm Mailing List,


I hope this email finds you well. I am currently working on a project that 
requires the ability to dynamically shrink or expand nodes for running jobs in 
Slurm. However, I am facing some challenges and would greatly appreciate your 
assistance and expertise in finding a solution.

In my research, I came across the following resources:

  1.  Slurm Advanced Usage Tutorial: I found a tutorial 
(https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf) that 
discusses advanced features of Slurm. It mentions the possibility of assigning 
and deassigning nodes to a job, which is exactly what I need. However, the 
tutorial refers to the FAQ for more detailed information.

  2.  Stack Overflow Question: I also came across a related question on Stack 
Overflow 
(https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm)
 that discusses updating the node number for a job in Slurm. The answer 
suggests that it is indeed possible, but again, it refers to the FAQ for 
further details.

Upon reviewing the current FAQ, I found that it states node shrinking is only 
possible for pending jobs. Unfortunately, it does not provide additional 
information or examples to clarify if this functionality can be extended to 
running jobs.

I would be grateful if anyone could provide insight into the following:

  1.  Is it possible to dynamically shrink or expand nodes for running jobs in 
Slurm? If so, how can it be achieved?

  2.  Are there any alternative methods or workarounds to accomplish dynamic 
node scaling for running jobs in Slurm?

I kindly request your guidance, personal experiences, or any relevant resources 
that could shed light on this topic. Your expertise and assistance would 
greatly help me in successfully completing my project.

Thank you in advance for your time and support.

Best regards,


Maysam


Johannes Gutenberg University of Mainz



Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Diego Zuccato
IIUC it's not possible to increase resource usage once the job is 
started: it would mess the scheduler and MPI comms (probably).


But I also think you're trying to find a problem for a "solution". Just 
state the problem you're facing instead of proposing a solution :)
What software are you running? How does it detect that a resize is 
needed? How would it handle the expansion?


Diego

Il 28/06/2023 13:02, Rahmanpour Koushki, Maysam ha scritto:

Dear Slurm Mailing List,


I hope this email finds you well. I am currently working on a project 
that requires the ability to dynamically shrink or expand nodes for 
running jobs in Slurm. However, I am facing some challenges and would 
greatly appreciate your assistance and expertise in finding a solution.


In my research, I came across the following resources:

 1.

Slurm Advanced Usage Tutorial: I found a tutorial
(https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf
) that 
discusses advanced features of Slurm. It mentions the possibility of assigning and 
deassigning nodes to a job, which is exactly what I need. However, the tutorial 
refers to the FAQ for more detailed information.

 2.

Stack Overflow Question: I also came across a related question on
Stack Overflow

(https://stackoverflow.com/questions/49398201/how-to-update-job-node-number-in-slurm 
)
 that discusses updating the node number for a job in Slurm. The answer suggests that 
it is indeed possible, but again, it refers to the FAQ for further details.

Upon reviewing the current FAQ, I found that it states node shrinking is 
only possible for pending jobs. Unfortunately, it does not provide 
additional information or examples to clarify if this functionality can 
be extended to running jobs.


I would be grateful if anyone could provide insight into the following:

 1.

Is it possible to dynamically shrink or expand nodes for running
jobs in Slurm? If so, how can it be achieved?

 2.

Are there any alternative methods or workarounds to accomplish
dynamic node scaling for running jobs in Slurm?

I kindly request your guidance, personal experiences, or any relevant 
resources that could shed light on this topic. Your expertise and 
assistance would greatly help me in successfully completing my project.


Thank you in advance for your time and support.

Best regards,


Maysam


Johannes Gutenberg University of Mainz




--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] [ext] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Hagdorn, Magnus Karl Moritz
Hi Maysam,
you need to describe your job a little more. In the past I have used a
taskfarm approach [1] with worker jobs submitted to the cluster as a
job array. This way the system could grow and shrink depending on
available tasks/compute nodes.
Regards
Magnus

[1] http://doi.org/10.5334/jors.393


On Wed, 2023-06-28 at 11:02 +, Rahmanpour Koushki, Maysam wrote:
> Dear Slurm Mailing List,
> 
> I hope this email finds you well. I am currently working on a project
> that requires the ability to dynamically shrink or expand nodes for
> running jobs in Slurm. However, I am facing some challenges and would
> greatly appreciate your assistance and expertise in finding a
> solution.
> In my research, I came across the following resources:
>    1. Slurm Advanced Usage Tutorial: I found a tutorial
> (https://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf)
> that discusses advanced features of Slurm. It mentions the
> possibility of assigning and deassigning nodes to a job, which is
> exactly what I need. However, the tutorial refers to the FAQ for more
> detailed information.
>    2. Stack Overflow Question: I also came across a related question
> on Stack Overflow
> (https://stackoverflow.com/questions/49398201/how-to-update-job-node-
> number-in-slurm) that discusses updating the node number for a job in
> Slurm. The answer suggests that it is indeed possible, but again, it
> refers to the FAQ for further details.
> Upon reviewing the current FAQ, I found that it states node shrinking
> is only possible for pending jobs. Unfortunately, it does not provide
> additional information or examples to clarify if this functionality
> can be extended to running jobs.
> I would be grateful if anyone could provide insight into the
> following:
>    1. Is it possible to dynamically shrink or expand nodes for
> running jobs in Slurm? If so, how can it be achieved?
>    2. Are there any alternative methods or workarounds to accomplish
> dynamic node scaling for running jobs in Slurm?
> I kindly request your guidance, personal experiences, or any relevant
> resources that could shed light on this topic. Your expertise and
> assistance would greatly help me in successfully completing my
> project.
> Thank you in advance for your time and support. 
> Best regards,
> 
> Maysam
> 
> Johannes Gutenberg University of Mainz
> 

-- 
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
 
magnus.hagd...@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-helpd...@charite.de


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Slurm Rest API error

2023-06-28 Thread Ozeryan, Vladimir
Hello everyone,

I am trying to get access to Slurm REST API working.

JWT configured and token generated. All daemons are configured and running 
"slurmdbd, slurmctld and slurmrestd". I can successfully get to Slurm API with 
"slurm" user but that's it.
bash-4.2$ echo -e "GET /slurm/v0.0.39/jobs HTTP/1.1\r\nAccept: */*\r\n" | 
slurmrestd - That works.

But as my user I get the following error:

[user@sched01 slurm-23.02.3]$ curl localhost:6820/slurm/v0.0.39/diag --header 
"X-SLURM-USER-NAME: $USER" --header "X-SLURM-USER-TOKEN: $SLURM_JWT"
HTTP/1.1 500 INTERNAL ERROR
Connection: Close
Content-Length: 833
Content-Type: application/json

{
   "meta": {
 "plugin": {
   "type": "openapi\/v0.0.39",
   "name": "Slurm OpenAPI v0.0.39",
   "data_parser": "v0.0.39"
 },
 "client": {
   "source": "[localhost]:55960"
 },
 "Slurm": {
   "version": {
 "major": 23,
 "micro": 3,
 "minor": 2
   },
   "release": "23.02.3"
 }
   },
   "errors": [
 {
   "description": "openapi_get_db_conn() failed to open slurmdb connection",
   "error_number": 7000,
   "error": "Unable to connect to database",
   "source": "init_connection"
 },
 {
   "description": "slurm_get_statistics() failed to get slurmctld 
statistics",
   "error_number": -1,
   "error": "Unspecified error",
   "source": "_op_handler_diag"
 }
   ],
   "warnings": [
   ],
   "statistics": null

Thank you,

Vlad Ozeryan
AMDS - AB1 Linux-Support
vladimir.ozer...@jhuapl.edu
Ext. 23966



Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm

2023-06-28 Thread Chris Samuel

On 28/6/23 04:02, Rahmanpour Koushki, Maysam wrote:

Upon reviewing the current FAQ, I found that it states node shrinking is 
only possible for pending jobs. Unfortunately, it does not provide 
additional information or examples to clarify if this functionality can 
be extended to running jobs.


You can definitely release nodes from a running job, what I believe the 
FAQ is saying is you cannot do something like change the number of cores 
per node or memory you requested once a job is running.


As for why you'd do that, we've had people who (before we set up a 
mechanism to automatically reboot nodes to address this) would request 
more nodes than they needed, look for how fragmented kernel hugepages 
were and then exclude nodes where there were too many fragmented for 
their needs.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Slurm Rest API error

2023-06-28 Thread Brian Andrus

Vlad,

Actually, it looks like it is working. You are using v0.39 for the 
parser, which is trying to use OpenAPI calls. Unless you compiled with 
OpenAPI, that won't work.


Try using the 0.37 version and you may see a simpler result that is 
successful.


Brian Andrus

On 6/28/2023 11:05 AM, Ozeryan, Vladimir wrote:


Hello everyone,

I am trying to get access to Slurm REST API working.

JWT configured and token generated. All daemons are configured and 
running “slurmdbd, slurmctld and slurmrestd”. I can successfully get 
to Slurm API with “slurm” user but that’s it.


*bash-4.2$ echo -e "GET /slurm/v0.0.39/jobs HTTP/1.1\r\nAccept: 
*/*\r\n" | slurmrestd* – That works.


But as my user I get the following error:

[user@sched01 slurm-23.02.3]$ curl localhost:6820/slurm/v0.0.39/diag 
--header "X-SLURM-USER-NAME: $USER" --header "X-SLURM-USER-TOKEN: 
$SLURM_JWT”


HTTP/1.1 500 INTERNAL ERROR

Connection: Close

Content-Length: 833

Content-Type: application/json

{

   "meta": {

 "plugin": {

   "type": "openapi\/v0.0.39",

   "name": "Slurm OpenAPI v0.0.39",

   "data_parser": "v0.0.39"

 },

 "client": {

   "source": "[localhost]:55960"

 },

 "Slurm": {

   "version": {

 "major": 23,

 "micro": 3,

 "minor": 2

   },

   "release": "23.02.3"

 }

   },

   "errors": [

 {

   "description": "openapi_get_db_conn() failed to open slurmdb 
connection",


   "error_number": 7000,

   "error": "Unable to connect to database",

   "source": "init_connection"

     },

 {

   "description": "slurm_get_statistics() failed to get slurmctld 
statistics",


   "error_number": -1,

   "error": "Unspecified error",

   "source": "_op_handler_diag"

 }

   ],

   "warnings": [

   ],

   "statistics": null

Thank you,

Vlad Ozeryan

AMDS – AB1 Linux-Support

vladimir.ozer...@jhuapl.edu

Ext. 23966