date:20191113

Re: [slurm-users] Problem with accounting/slurmdbd

2019-11-13 Thread Uwe Seher

The second one seems to be solved. The other seems to be a general mysql
login problem.

Thank you!



Am Di., 12. Nov. 2019 um 04:45 Uhr schrieb Brian Andrus :

> That second one can happen as a race condition. It may be doing an update
> or running a report or what-not when you ran your command.
>
> If the issue persists, restart mysql and slurmdbd.
>
> Brian Andrus
> On 11/11/2019 2:10 AM, Uwe Seher wrote:
>
> Hello!
> I like zu use accounting via slurmdbd/mariadb and have some problems with
> connection to the database.
> When i try to connect via sacct or ascctmgr as a non-root user the
> connection is completely refused:
>
> sacctmgr: add cluster MPI_IBK
>  Adding Cluster(s)
>   Name   = mpi_ibk
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
>  Problem adding clusters: Access/permission denied
>
> I think this has something to do with the second problem, when trying to use 
> sacctmgr as root.
>
> sacctmgr: add cluster name=mpi_ibk
>  Adding Cluster(s)
>   Name   = mpi_ibk
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
>
>  Database is busy or waiting for lock from other user.
>
> The first problem is caused by the lack of a configuration, as default only a 
> user 'root' is configured in the database which can start some transactions. 
> But for the second one i have no idea, the database is used only for slurm, i 
> can log in with the configured user, all deamons are restarted and working 
> fine.
> The authentication inside slurm should work with the default munge service 
> and i think this is also working in a kind of way, because the connection 
> seems to be established. But i can not do any configuration, so no further 
> logging is possible. Below are some further infomations.
>
> Thank you in advance for some hints concerning this issue.
>
> Regards
>
> Uwe Seher
>
>  The accounting setup in slurm.conf is the following:
>
>  # ACCOUNTING
>  JobAcctGatherType=jobacct_gather/linux
>  JobAcctGatherFrequency=30
>  # file
>  JobCompType=jobcomp/filetxt
>  JobCompLoc=/var/log/slurm_jobs.log
>  #AccountingStorageType=accounting_storage/filetxt
>  #AccountingStorageLoc=/var/log/slurm_acc.log
>  #slurmdb
>  AccountingStorageType=accounting_storage/slurmdbd
>  AccountingStorageHost=localhost
>  #AccountingStoragePass=*
>  AccountingStorageUser=slurm
>
> sacctmgr show configuration shows this:
>
> sacctmgr show configurationConfiguration data as of 
> 2019-11-11T10:58:04AccountingStorageBackupHost  = (null)AccountingStorageHost 
>  = localhostAccountingStorageLoc   = N/AAccountingStoragePass  = 
> (null)AccountingStoragePort  = 6819AccountingStorageType  = 
> accounting_storage/slurmdbdAccountingStorageUser  = N/AAuthType   
> = auth/mungeMessageTimeout = 10 secPluginDir  = 
> /usr/lib64/slurmPrivateData= noneSlurmUserId= 
> slurm(400)SLURM_CONF = /etc/slurm/slurm.confSLURM_VERSION 
>  = 17.11.13TCPTimeout = 2 secTrackWCKey = 0SlurmDBD 
> configuration:ArchiveDir = /tmpArchiveEvents  = 
> NoArchiveJobs= NoArchiveResvs   = NoArchiveScript 
>  = (null)ArchiveSteps   = NoArchiveSuspend = NoArchiveTXN 
> = NoArchiveUsage   = NoAuthInfo   = 
> (null)AuthType   = auth/mungeBOOT_TIME  = 
> 2019-11-11T09:29:01CommitDelay= NoDbdAddr= 
> localhostDbdBackupHost  = (null)DbdHost= 
> localhostDbdPort= 6819DebugFlags = 
> (null)DebugLevel = verboseDebugLevelSyslog   = 
> quietDefaultQOS = (null)LogFile= 
> /var/log/slurmdbd.logMaxQueryTimeRange  = UNLIMITEDMessageTimeout 
> = 10 secsPidFile= /var/run/slurm/slurmdbd.pidPluginDir
>   = /usr/lib64/slurmPrivateData= nonePurgeEventAfter= 
> NONEPurgeJobAfter  = NONEPurgeResvAfter = NONEPurgeStepAfter  
>= NONEPurgeSuspendAfter  = NONEPurgeTXNAfter  = 
> NONEPurgeUsageAfter= NONESLURMDBD_CONF  = 
> /etc/slurm/slurmdbd.confSLURMDBD_VERSION   = 17.11.13SlurmUser
>   = slurm(400)StorageBackupHost  = (null)StorageHost= 
> localhostStorageLoc = slurm_acct_dbStoragePort= 
> 3306StorageType= accounting_storage/mysqlStorageUser= 
> slurmTCPTimeout = 2 secsTrackWCKey = 
> NoTrackSlurmctldDown = No
>
>

Re: [slurm-users] Problem with accounting/slurmdbd

2019-11-13 Thread Uwe Seher

Just for completition:
There has been a lock in the database when creating a table, you can see
with

MariaDB [slurm_acct_db]> show full processlist;
++-+---+---+-+--+-++--+
| Id | User| Host  | db| Command | Time |
State   | Info





   | Progress |
++-+---+---+-+--+-++--+
|  1 | system user |   | NULL  | Daemon  | NULL |
InnoDB purge coordinator| NULL





   |0.000 |
|  3 | system user |   | NULL  | Daemon  | NULL |
InnoDB purge worker | NULL





   |0.000 |
|  4 | system user |   | NULL  | Daemon  | NULL |
InnoDB purge worker | NULL





   |0.000 |
|  2 | system user |   | NULL  | Daemon  | NULL |
InnoDB purge worker | NULL





   |0.000 |
|  5 | system user |   | NULL  | Daemon  | NULL |
InnoDB shutdown handler | NULL





   |0.000 |
| 11 | slurm   | localhost | slurm_acct_db | Sleep   |  943 |
   | NULL





  |0.000 |
| 12 | slurm   | localhost | slurm_acct_db | Sleep   |   13 |
   | NULL





  |0.000 |
| 20 | slurm   | localhost | slurm_acct_db | Query   |  307 |
Waiting for table metadata lock | create table if not exists
"mpi_ibk_event_table" (`time_start` bigint unsigned not null,
`time_end` bigint unsigned default 0 not null, `node_name` tinytext
default '' not null, `cluster_nodes` text not null default '',
`reason` tinytext not null, `reason_uid` int unsigned default
0xfffe not null, `state` smallint unsigned default 0 not null,
`tres` text not null default '', primary key (node_name(20),
time_start)) engine='innodb' |0.000 |
| 22 | root| localhost | slurm_acct_db | Query   |0 | init
   | show full processlist





  |0.000 |
++-+---+---+-+--+-++--+
So this produced the second issue i think. The first issue is solved
too, but it is not so clear why. My actual explanation is that I
thought, systemctl restart mysql should restart the whole server (like
postgres does ;)) but does not what it is thought to do. After a
dedicated stop - start - procedure everything works like a charm.

Thank you for your help!



Am Di., 12. Nov. 2019 um 04:45 Uhr schrieb Brian Andrus :

> That second one can happen as a race condition. It may be doing an update
> or running a report or what-not when you ran your command.
>
> If the issue persists, restart mysql and slurmdbd.
>
> Brian Andrus
> On 11/11/2019 2:10 AM, Uwe Seher wrote:
>
> Hello!
> I like zu use accounting via slurmdbd/mariadb and have some problems with
> connection to the database.
> When i try to connect via sacct or ascctmgr as a non-root user the
> connection is completely refused:
>
> sacctmgr: add cluster MPI_IBK
>  Adding Cluster(s)
>   Name   = mpi_ibk
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
>  Problem adding clusters: Access/permission denied
>
> I

[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus


Hi,

I run gmx 2019 using GPU
There are 4 GPUs in my GPU hosts.
I have slurm and configured gres=gpu

1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used 
(-gpu_id $CUDA_VISIBLE_DEVICES).
2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 
and selected, but GPU #0 is identified by gmx as a compatible gpu.

From the output:

gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu 
-npme 1 -ntmpi 4


  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: 
compatible


Fatal error:
You limited the set of compatible GPUs to a set that included ID #1, but 
that

ID is not for a compatible GPU. List only compatible GPUs.

3. If I login to that node and run the mdrun command written into the 
output in the previous step then it selects the right gpu and runs as 
expected.


$CUDA_DEVICE_ORDER is set to PCI_BUS_ID

I can not decide if this is a slurm config error or something with 
gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect 
gromacs to detect all 4GPUs.


Thanks for your help and suggestions,
Tamas

--

Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Renfro, Michael

Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
// Number of CUDA devices
int devCount; cudaGetDeviceCount(&devCount);
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("CUDA Device #%d: ", i);
cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i);
printDevProp(devProp);
}
return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Nov 13, 2019, at 9:54 AM, Tamas Hegedus  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Hi,
> 
> I run gmx 2019 using GPU
> There are 4 GPUs in my GPU hosts.
> I have slurm and configured gres=gpu
> 
> 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used
> (-gpu_id $CUDA_VISIBLE_DEVICES).
> 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1
> and selected, but GPU #0 is identified by gmx as a compatible gpu.
> From the output:
> 
> gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu
> -npme 1 -ntmpi 4
> 
>  GPU info:
>Number of GPUs detected: 1
>#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
> 
> Fatal error:
> You limited the set of compatible GPUs to a set that included ID #1, but
> that
> ID is not for a compatible GPU. List only compatible GPUs.
> 
> 3. If I login to that node and run the mdrun command written into the
> output in the previous step then it selects the right gpu and runs as
> expected.
> 
> $CUDA_DEVICE_ORDER is set to PCI_BUS_ID
> 
> I can not decide if this is a slurm config error or something with
> gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect
> gromacs to detect all 4GPUs.
> 
> Thanks for your help and suggestions,
> Tamas
> 
> --
> 
> Tamas Hegedus, PhD
> Senior Research Fellow
> Department of Biophysics and Radiation Biology
> Semmelweis University | phone: (36) 1-459 1500/60233
> Tuzolto utca 37-47| mailto:ta...@hegelab.org
> Budapest, 1094, Hungary   | http://www.hegelab.org
> 
>

[slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies

We have currently version 18.08.7 installed on our cluster and want to 
upgrade to 19.03.3.. So I wanted to start small and installed it one of 
our compute node. Buy if I start the 'slurmd' then our slurmctld will 
complain that:

{{{
2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code
[2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: 
Unspecified error
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid 
Protocol Version 8704 from uid=-1 at 10.10.0.40:32548
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code

}}}


I have read about the RPC protocol:
 * https://slurm.schedmd.com/rpc.html

Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this 
setup supported and something else goes wrong?


Regards

--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus

Thanks for your suggestion. You are right, I do not have to deal with 
specific GPUs.
(I have not tried to compile your code, I simply tested two gromacs runs 
on the same node with -gres=gpu:1 options.)


On 11/13/19 5:17 PM, Renfro, Michael wrote:

Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
 printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
 printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
 // Number of CUDA devices
 int devCount; cudaGetDeviceCount(&devCount);
 printf("There are %d CUDA devices.\n", devCount);
 // Iterate through devices
 for (int i = 0; i < devCount; ++i)
 {
 // Get device properties
 printf("CUDA Device #%d: ", i);
 cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i);
 printDevProp(devProp);
 }
 return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=



--
Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Ole Holm Nielsen


On 13-11-2019 18:04, Bas van der Vlies wrote:
We have currently version 18.08.7 installed on our cluster and want to 
upgrade to 19.03.3.. So I wanted to start small and installed it one of 
our compute node. Buy if I start the 'slurmd' then our slurmctld will 
complain that:

{{{
2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code
[2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: 
Unspecified error
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid 
Protocol Version 8704 from uid=-1 at 10.10.0.40:32548
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code

}}}


I have read about the RPC protocol:
  * https://slurm.schedmd.com/rpc.html

Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this 
setup supported and something else goes wrong?


Hi Bas,

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Versions may be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands

Perhaps you may find some useful further information in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

/Ole

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Christopher Samuel


On 11/13/19 10:42 AM, Ole Holm Nielsen wrote:

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Also the documentation for upgrading here:

https://slurm.schedmd.com/quickstart_admin.html#upgrade

As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally 
slurmd's.  This is required because of the way the RPC protocol support 
for older versions works.


--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies





Hi Bas,

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Versions may be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands


Thanks a lot Ole. This helps a a lot.

Regards


--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies





On 11/13/19 8:36 PM, Christopher Samuel wrote:


https://slurm.schedmd.com/quickstart_admin.html#upgrade

As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally 
slurmd's.  This is required because of the way the RPC protocol support 
for older versions works.




Thanks Chris I also found the above link. I read the RPC documentation 
wrong and now have the correct procedure for upgrading

--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Chris Samuel

On Wednesday, 13 November 2019 10:11:30 AM PST Tamas Hegedus wrote:

> Thanks for your suggestion. You are right, I do not have to deal with
> specific GPUs.
> (I have not tried to compile your code, I simply tested two gromacs runs
> on the same node with -gres=gpu:1 options.)

How are you controlling access to GPUs?  Is that via cgroups?

If so you should be fine, but if you're not using cgroups to control access 
then you may well find that they are sharing the same GPU.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Problem with accounting/slurmdbd

Re: [slurm-users] Problem with accounting/slurmdbd

[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

[slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

11 matches

Site Navigation

Mail list logo

Footer information