Re: [slurm-users] issue with mpirun when using through slurm / pmix

2021-10-21 Thread Bas van der Vlies
I have no other solution this was the solution at our site.

> On 22 Oct 2021, at 03:19, pankajd  wrote:
> 
> thanks, but after setting PMIX_MCA_psec=native, now mpirun hangs and does not 
> produce any output. 
> 
> On October 21, 2021 at 9:21 PM Bas van der Vlies  
> wrote: 
> > At our side we also add this problem that the pmix lib was compiled with 
> > munge support. We solved it by setting this environment variable: 
> > * export PMIX_MCA_psec=native of export PMIX_MCA_psec=none 
> > 
> > Regard, 
> > 
> > Bas 
> > 
> > On 21/10/2021 16:59, Pankaj Dorlikar wrote: 
> > > Hi, 
> > > 
> > > When using slurm-20.11.7 compiled with pmix-3.2.3,  and job is submitted 
> > > like below : 
> > > 
> > > srun -N 1 -c 2 --pty /bin/bash 
> > > 
> > > on the allocated compute node, when I execute the below command, I get 
> > > the PMI error with return value -46 
> > > 
> > > mpirun -c 2 /bin/hostname 
> > > 
> > > --
> > >  
> > > 
> > > A requested component was not found, or was unable to be opened.  This 
> > > 
> > > means that this component is either not installed or is unable to be 
> > > 
> > > used on your system (e.g., sometimes this means that shared libraries 
> > > 
> > > that the component requires are unable to be found/loaded).  Note that 
> > > 
> > > PMIX stopped checking at the first component that it did not find. 
> > > 
> > > Host: cnode9 
> > > 
> > > Framework: psec 
> > > 
> > > Component: munge 
> > > 
> > > --
> > >  
> > > 
> > > --
> > >  
> > > 
> > > It looks like pmix_init failed for some reason; your parallel process is 
> > > 
> > > likely to abort.  There are many reasons that a parallel process can 
> > > 
> > > fail during pmix_init; some of which are due to configuration or 
> > > 
> > > environment problems.  This failure appears to be an internal failure; 
> > > 
> > > here's some additional information (which may only be relevant to an 
> > > 
> > > PMIX developer): 
> > > 
> > >   pmix_psec_base_open failed 
> > > 
> > >   --> Returned value -46 instead of PMIX_SUCCESS 
> > > 
> > > --
> > >  
> > > 
> > > [cnode9:2708617] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at 
> > > line 237 
> > > 
> > > 
> > > 
> > >  
> > > 
> > > [ C-DAC is on Social-Media too. Kindly follow us at: 
> > > Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] 
> > > 
> > > This e-mail is for the sole use of the intended recipient(s) and may 
> > > contain confidential and privileged information. If you are not the 
> > > intended recipient, please contact the sender by reply e-mail and destroy 
> > > all copies and the original message. Any unauthorized review, use, 
> > > disclosure, dissemination, forwarding, printing or copying of this email 
> > > is strictly prohibited and appropriate legal action will be taken. 
> > > 
> > >  
> > > 
> > 
> > -- 
> > Bas van der Vlies 
> > | HPCV Supercomputing | Internal Services | SURF | 
> > https://userinfo.surfsara.nl | 
> > | Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 | 
> > | bas.vandervl...@surf.nl
> 
> 
> For assimilation and dissemination of knowledge, visit cakes.cdac.in 
> 
> 
>  
> [ C-DAC is on Social-Media too. Kindly follow us at: 
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] 
> 
> This e-mail is for the sole use of the intended recipient(s) and may 
> contain confidential and privileged information. If you are not the 
> intended recipient, please contact the sender by reply e-mail and destroy 
> all copies and the original message. Any unauthorized review, use, 
> disclosure, dissemination, forwarding, printing or copying of this email 
> is strictly prohibited and appropriate legal action will be taken. 
> 




Re: [slurm-users] issue with mpirun when using through slurm / pmix

2021-10-21 Thread pankajd
thanks, but after setting PMIX_MCA_psec=native, now mpirun hangs and does not
produce any output.

On October 21, 2021 at 9:21 PM Bas van der Vlies 
wrote:
> At our side we also add this problem that the pmix lib was compiled with
> munge support. We solved it by setting this environment variable:
> * export PMIX_MCA_psec=native of export PMIX_MCA_psec=none
>
> Regard,
>
> Bas
>
> On 21/10/2021 16:59, Pankaj Dorlikar wrote:
> > Hi,
> >
> > When using slurm-20.11.7 compiled with pmix-3.2.3,  and job is submitted
> > like below :
> >
> > srun -N 1 -c 2 --pty /bin/bash
> >
> > on the allocated compute node, when I execute the below command, I get
> > the PMI error with return value -46
> >
> > mpirun -c 2 /bin/hostname
> >
> > --
> >
> > A requested component was not found, or was unable to be opened.  This
> >
> > means that this component is either not installed or is unable to be
> >
> > used on your system (e.g., sometimes this means that shared libraries
> >
> > that the component requires are unable to be found/loaded).  Note that
> >
> > PMIX stopped checking at the first component that it did not find.
> >
> > Host: cnode9
> >
> > Framework: psec
> >
> > Component: munge
> >
> > --
> >
> > --
> >
> > It looks like pmix_init failed for some reason; your parallel process is
> >
> > likely to abort.  There are many reasons that a parallel process can
> >
> > fail during pmix_init; some of which are due to configuration or
> >
> > environment problems.  This failure appears to be an internal failure;
> >
> > here's some additional information (which may only be relevant to an
> >
> > PMIX developer):
> >
> >   pmix_psec_base_open failed
> >
> >   --> Returned value -46 instead of PMIX_SUCCESS
> >
> > --
> >
> > [cnode9:2708617] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at
> > line 237
> >
> >
> > 
> >
> > [ C-DAC is on Social-Media too. Kindly follow us at:
> > Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
> >
> > This e-mail is for the sole use of the intended recipient(s) and may
> > contain confidential and privileged information. If you are not the
> > intended recipient, please contact the sender by reply e-mail and destroy
> > all copies and the original message. Any unauthorized review, use,
> > disclosure, dissemination, forwarding, printing or copying of this email
> > is strictly prohibited and appropriate legal action will be taken.
> > 
> >
>
> --
> Bas van der Vlies
> | HPCV Supercomputing | Internal Services | SURF |
> https://userinfo.surfsara.nl |
> | Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
> | bas.vandervl...@surf.nl

For assimilation and dissemination of knowledge, visit cakes.cdac.in 



[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.




Re: [slurm-users] issue with mpirun when using through slurm / pmix

2021-10-21 Thread Bas van der Vlies
At our side we also add this problem that the pmix lib was compiled with 
munge support.  We solved it by setting this environment variable:

 *  export PMIX_MCA_psec=native  of export PMIX_MCA_psec=none

Regard,

Bas

On 21/10/2021 16:59, Pankaj Dorlikar wrote:

Hi,

When using slurm-20.11.7 compiled with pmix-3.2.3,  and job is submitted 
like below :


srun -N 1 -c 2 --pty /bin/bash

on the allocated compute node, when I execute the below command, I get 
the PMI error with return value -46


mpirun -c 2 /bin/hostname

--

A requested component was not found, or was unable to be opened.  This

means that this component is either not installed or is unable to be

used on your system (e.g., sometimes this means that shared libraries

that the component requires are unable to be found/loaded).  Note that

PMIX stopped checking at the first component that it did not find.

Host: cnode9

Framework: psec

Component: munge

--

--

It looks like pmix_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during pmix_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

PMIX developer):

   pmix_psec_base_open failed

   --> Returned value -46 instead of PMIX_SUCCESS

--

[cnode9:2708617] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at 
line 237



 


[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
 



--
Bas van der Vlies
| HPCV Supercomputing | Internal Services  | SURF | 
https://userinfo.surfsara.nl |

| Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
|  bas.vandervl...@surf.nl


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] issue with mpirun when using through slurm / pmix

2021-10-21 Thread Pankaj Dorlikar
Hi,

 

 

When using slurm-20.11.7 compiled with pmix-3.2.3,  and job is submitted
like below :

 

srun -N 1 -c 2 --pty /bin/bash

 

on the allocated compute node, when I execute the below command, I get the
PMI error with return value -46

 

mpirun -c 2 /bin/hostname

--

A requested component was not found, or was unable to be opened.  This

means that this component is either not installed or is unable to be

used on your system (e.g., sometimes this means that shared libraries

that the component requires are unable to be found/loaded).  Note that

PMIX stopped checking at the first component that it did not find.

 

Host: cnode9

Framework: psec

Component: munge

--

 

--

It looks like pmix_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during pmix_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

PMIX developer):

 

  pmix_psec_base_open failed

  --> Returned value -46 instead of PMIX_SUCCESS

--

 

[cnode9:2708617] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line
237



[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.




Re: [slurm-users] [EXT] Re: Missing data in sreport for a time period in slurm

2021-10-21 Thread Sean Crosby
That's good to hear. I had the same issue a few years ago, also with Slurm 19, 
and that was the advice the Slurm support staff gave me.

I do recommend upgrading your Slurm though. sreport is much faster in later 
Slurm versions.

Sean

From: mshubham 
Sent: Thursday, 21 October 2021 21:54
To: slurm-users@lists.schedmd.com ; Sean Crosby 

Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period 
in slurm

External email: Please exercise caution


Hi Sean,
After changing those values yesterday, it worked today after the data got 
rebuild, i can see the sreport working now.
Thank you all so much for your help in solving this issue.


On October 20, 2021 at 6:38 PM mshubham  wrote:
Hi Sean,
Yes, i have changed those values to that time. After restarting slurmdbd, 
timestamp changed again to today's date but sreport still not showing data for 
a year.

On October 19, 2021 at 10:55 AM Sean Crosby  wrote:
Did you change those values to a timestamp before May 2020 like I suggested? If 
you do that, then Slurm will regenerate the data needed for sreport to work 
properly.

Sean


From: mshubham 
Sent: Tuesday, 19 October 2021 16:20
To: slurm-users@lists.schedmd.com ; Sean Crosby 

Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period 
in slurm

External email: Please exercise caution

Dear All,
By checking the value of last ran table, hourly rollup shows today's date.
+---+--++-
| hourly_rollup | daily_rollup | monthly_rollup|
+---+--++-
|1634617800 |   1634581800 | 1634581785 |
+---+--++-
As we can see the data from last 1 month in sreport but not the data from may 
2020 to sept 21, but the same can be visible through sacct.


On October 19, 2021 at 12:27 AM Sean Crosby  wrote:
sreport keeps a track of when it has done the last rollup calculations in the 
database.

Open MySQL for your Slurm accounting database, do

select * from slurm_acct_db.clustername_last_ran_table;

where slurm_acct_db is your accounting database name (slurm_acct_db is 
default), and clustername is the name of your cluster in Slurm.

The numbers are UNIX timestamps. e.g for mine

select * from thespian_slurm_acct_db.thespian_last_ran_table;
+---+--++
| hourly_rollup | daily_rollup | monthly_rollup |
+---+--++
|163458 |   1634562000 | 1633420706 |
+---+--++
1 row in set (0.00 sec)

163458 is Mon Oct 18 2021 18:00:00 GMT+

If you set those values to something before 26 May 2020, you can then either 
wait for the rollups to happen again, or you can trigger them to start by 
restarting slurmdbd. Once they start, you can monitor their progress by running 
that SQL query again. Once they have caught up again, sreport should be back to 
normal.

Sean


From: slurm-users  on behalf of mshubham 

Sent: Tuesday, 19 October 2021 00:22
To: slurm-users@lists.schedmd.com 
Subject: [EXT] Re: [slurm-users] Missing data in sreport for a time period in 
slurm

External email: Please exercise caution

Dear all,
As i recall, there were some runaway jobs one month ago, and i used this 
command to fix it, and from that date only(from 15 Sept) , previous data from 
26 May 2020 to Sept 14 2021
has been gone in sreport.
I hope there is no co-relation between using this command and data missing in 
sreport.

I have backup from that day of mysql, if you can suggest some way to integrate 
that data into slurm table in mysql.
Or is there any other way so that i can reinitialize the slurm utillization 
tables from initial.


On October 18, 2021 at 5:38 PM mshubham  wrote:
Dear all,
 "sacctmgr show runawayjobs" doesnot show any problematic jobs.

On 10/18/21 12:41 PM, mshubham wrote:
> Dear all,
> I am facing a issue in slurm(v19.05.1) in which data from 26 May 2020 to
> Sept 14 2021 is missing in sreport but the same data is present through
> sacct command, It which was working fine few days ago.
> Right now, we have to get data utilization from sacct command for each user.
> It would be really helpful if this issue get resolved.

Does "sacctmgr show runawayjobs" reveal any problematic jobs?

Thanks and Regards,
Shubham


Thanks and Regards,
Shubham Mehta
HPC Technology
CDAC Pune


[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. 

Re: [slurm-users] [EXT] Re: Missing data in sreport for a time period in slurm

2021-10-21 Thread mshubham
Hi Sean,
After changing those values yesterday, it worked today after the data got
rebuild, i can see the sreport working now.
Thank you all so much for your help in solving this issue.


On October 20, 2021 at 6:38 PM mshubham  wrote:

>  Hi Sean,
>  Yes, i have changed those values to that time. After restarting slurmdbd,
> timestamp changed again to today's date but sreport still not showing data for
> a year.
> 
>  On October 19, 2021 at 10:55 AM Sean Crosby  wrote:
> 
>   > >   Did you change those values to a timestamp before May 2020 like I
>   > > suggested? If you do that, then Slurm will regenerate the data needed
>   > > for sreport to work properly.
> > 
> >   Sean
> > 
> > 
> >  
> > -