Re: [slurm-users] Problem with accounting/slurmdbd
The second one seems to be solved. The other seems to be a general mysql login problem. Thank you! Am Di., 12. Nov. 2019 um 04:45 Uhr schrieb Brian Andrus : > That second one can happen as a race condition. It may be doing an update > or running a report or what-not when you ran your command. > > If the issue persists, restart mysql and slurmdbd. > > Brian Andrus > On 11/11/2019 2:10 AM, Uwe Seher wrote: > > Hello! > I like zu use accounting via slurmdbd/mariadb and have some problems with > connection to the database. > When i try to connect via sacct or ascctmgr as a non-root user the > connection is completely refused: > > sacctmgr: add cluster MPI_IBK > Adding Cluster(s) > Name = mpi_ibk > Would you like to commit changes? (You have 30 seconds to decide) > (N/y): y > Problem adding clusters: Access/permission denied > > I think this has something to do with the second problem, when trying to use > sacctmgr as root. > > sacctmgr: add cluster name=mpi_ibk > Adding Cluster(s) > Name = mpi_ibk > Would you like to commit changes? (You have 30 seconds to decide) > (N/y): y > > Database is busy or waiting for lock from other user. > > The first problem is caused by the lack of a configuration, as default only a > user 'root' is configured in the database which can start some transactions. > But for the second one i have no idea, the database is used only for slurm, i > can log in with the configured user, all deamons are restarted and working > fine. > The authentication inside slurm should work with the default munge service > and i think this is also working in a kind of way, because the connection > seems to be established. But i can not do any configuration, so no further > logging is possible. Below are some further infomations. > > Thank you in advance for some hints concerning this issue. > > Regards > > Uwe Seher > > The accounting setup in slurm.conf is the following: > > # ACCOUNTING > JobAcctGatherType=jobacct_gather/linux > JobAcctGatherFrequency=30 > # file > JobCompType=jobcomp/filetxt > JobCompLoc=/var/log/slurm_jobs.log > #AccountingStorageType=accounting_storage/filetxt > #AccountingStorageLoc=/var/log/slurm_acc.log > #slurmdb > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=localhost > #AccountingStoragePass=* > AccountingStorageUser=slurm > > sacctmgr show configuration shows this: > > sacctmgr show configurationConfiguration data as of > 2019-11-11T10:58:04AccountingStorageBackupHost = (null)AccountingStorageHost > = localhostAccountingStorageLoc = N/AAccountingStoragePass = > (null)AccountingStoragePort = 6819AccountingStorageType = > accounting_storage/slurmdbdAccountingStorageUser = N/AAuthType > = auth/mungeMessageTimeout = 10 secPluginDir = > /usr/lib64/slurmPrivateData= noneSlurmUserId= > slurm(400)SLURM_CONF = /etc/slurm/slurm.confSLURM_VERSION > = 17.11.13TCPTimeout = 2 secTrackWCKey = 0SlurmDBD > configuration:ArchiveDir = /tmpArchiveEvents = > NoArchiveJobs= NoArchiveResvs = NoArchiveScript > = (null)ArchiveSteps = NoArchiveSuspend = NoArchiveTXN > = NoArchiveUsage = NoAuthInfo = > (null)AuthType = auth/mungeBOOT_TIME = > 2019-11-11T09:29:01CommitDelay= NoDbdAddr= > localhostDbdBackupHost = (null)DbdHost= > localhostDbdPort= 6819DebugFlags = > (null)DebugLevel = verboseDebugLevelSyslog = > quietDefaultQOS = (null)LogFile= > /var/log/slurmdbd.logMaxQueryTimeRange = UNLIMITEDMessageTimeout > = 10 secsPidFile= /var/run/slurm/slurmdbd.pidPluginDir > = /usr/lib64/slurmPrivateData= nonePurgeEventAfter= > NONEPurgeJobAfter = NONEPurgeResvAfter = NONEPurgeStepAfter >= NONEPurgeSuspendAfter = NONEPurgeTXNAfter = > NONEPurgeUsageAfter= NONESLURMDBD_CONF = > /etc/slurm/slurmdbd.confSLURMDBD_VERSION = 17.11.13SlurmUser > = slurm(400)StorageBackupHost = (null)StorageHost= > localhostStorageLoc = slurm_acct_dbStoragePort= > 3306StorageType= accounting_storage/mysqlStorageUser= > slurmTCPTimeout = 2 secsTrackWCKey = > NoTrackSlurmctldDown = No > >
Re: [slurm-users] Problem with accounting/slurmdbd
Just for completition: There has been a lock in the database when creating a table, you can see with MariaDB [slurm_acct_db]> show full processlist; ++-+---+---+-+--+-++--+ | Id | User| Host | db| Command | Time | State | Info | Progress | ++-+---+---+-+--+-++--+ | 1 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator| NULL |0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL |0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL |0.000 | | 2 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL |0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL |0.000 | | 11 | slurm | localhost | slurm_acct_db | Sleep | 943 | | NULL |0.000 | | 12 | slurm | localhost | slurm_acct_db | Sleep | 13 | | NULL |0.000 | | 20 | slurm | localhost | slurm_acct_db | Query | 307 | Waiting for table metadata lock | create table if not exists "mpi_ibk_event_table" (`time_start` bigint unsigned not null, `time_end` bigint unsigned default 0 not null, `node_name` tinytext default '' not null, `cluster_nodes` text not null default '', `reason` tinytext not null, `reason_uid` int unsigned default 0xfffe not null, `state` smallint unsigned default 0 not null, `tres` text not null default '', primary key (node_name(20), time_start)) engine='innodb' |0.000 | | 22 | root| localhost | slurm_acct_db | Query |0 | init | show full processlist |0.000 | ++-+---+---+-+--+-++--+ So this produced the second issue i think. The first issue is solved too, but it is not so clear why. My actual explanation is that I thought, systemctl restart mysql should restart the whole server (like postgres does ;)) but does not what it is thought to do. After a dedicated stop - start - procedure everything works like a charm. Thank you for your help! Am Di., 12. Nov. 2019 um 04:45 Uhr schrieb Brian Andrus : > That second one can happen as a race condition. It may be doing an update > or running a report or what-not when you ran your command. > > If the issue persists, restart mysql and slurmdbd. > > Brian Andrus > On 11/11/2019 2:10 AM, Uwe Seher wrote: > > Hello! > I like zu use accounting via slurmdbd/mariadb and have some problems with > connection to the database. > When i try to connect via sacct or ascctmgr as a non-root user the > connection is completely refused: > > sacctmgr: add cluster MPI_IBK > Adding Cluster(s) > Name = mpi_ibk > Would you like to commit changes? (You have 30 seconds to decide) > (N/y): y > Problem adding clusters: Access/permission denied > > I
[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Hi, I run gmx 2019 using GPU There are 4 GPUs in my GPU hosts. I have slurm and configured gres=gpu 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used (-gpu_id $CUDA_VISIBLE_DEVICES). 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 and selected, but GPU #0 is identified by gmx as a compatible gpu. From the output: gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu -npme 1 -ntmpi 4 GPU info: Number of GPUs detected: 1 #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible Fatal error: You limited the set of compatible GPUs to a set that included ID #1, but that ID is not for a compatible GPU. List only compatible GPUs. 3. If I login to that node and run the mdrun command written into the output in the previous step then it selects the right gpu and runs as expected. $CUDA_DEVICE_ORDER is set to PCI_BUS_ID I can not decide if this is a slurm config error or something with gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect gromacs to detect all 4GPUs. Thanks for your help and suggestions, Tamas -- Tamas Hegedus, PhD Senior Research Fellow Department of Biophysics and Radiation Biology Semmelweis University | phone: (36) 1-459 1500/60233 Tuzolto utca 37-47| mailto:ta...@hegelab.org Budapest, 1094, Hungary | http://www.hegelab.org
Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job. Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc): = // From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu #include void printDevProp(cudaDeviceProp dP) { printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount); printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID); } int main() { // Number of CUDA devices int devCount; cudaGetDeviceCount(&devCount); printf("There are %d CUDA devices.\n", devCount); // Iterate through devices for (int i = 0; i < devCount; ++i) { // Get device properties printf("CUDA Device #%d: ", i); cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i); printDevProp(devProp); } return 0; } = When run from two simultaneous jobs on the same node (each with a gres=gpu), I get: = [renfro@gpunode003(job 221584) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 5, DeviceID 0 = [renfro@gpunode003(job 221585) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 6, DeviceID 0 = -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Nov 13, 2019, at 9:54 AM, Tamas Hegedus wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > > > Hi, > > I run gmx 2019 using GPU > There are 4 GPUs in my GPU hosts. > I have slurm and configured gres=gpu > > 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used > (-gpu_id $CUDA_VISIBLE_DEVICES). > 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 > and selected, but GPU #0 is identified by gmx as a compatible gpu. > From the output: > > gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu > -npme 1 -ntmpi 4 > > GPU info: >Number of GPUs detected: 1 >#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: > compatible > > Fatal error: > You limited the set of compatible GPUs to a set that included ID #1, but > that > ID is not for a compatible GPU. List only compatible GPUs. > > 3. If I login to that node and run the mdrun command written into the > output in the previous step then it selects the right gpu and runs as > expected. > > $CUDA_DEVICE_ORDER is set to PCI_BUS_ID > > I can not decide if this is a slurm config error or something with > gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect > gromacs to detect all 4GPUs. > > Thanks for your help and suggestions, > Tamas > > -- > > Tamas Hegedus, PhD > Senior Research Fellow > Department of Biophysics and Radiation Biology > Semmelweis University | phone: (36) 1-459 1500/60233 > Tuzolto utca 37-47| mailto:ta...@hegelab.org > Budapest, 1094, Hungary | http://www.hegelab.org > >
[slurm-users] Upgrade slurm to 19.05.3 from 18.08.7
We have currently version 18.08.7 installed on our cluster and want to upgrade to 19.03.3.. So I wanted to start small and installed it one of our compute node. Buy if I start the 'slurmd' then our slurmctld will complain that: {{{ 2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible versions of client and server code [2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: Unspecified error [2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.10.0.40:32548 [2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible versions of client and server code }}} I have read about the RPC protocol: * https://slurm.schedmd.com/rpc.html Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this setup supported and something else goes wrong? Regards -- Bas van der Vlies | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam | T +31 (0) 20 800 1300 | bas.vandervl...@surfsara.nl | www.surfsara.nl |
Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
Thanks for your suggestion. You are right, I do not have to deal with specific GPUs. (I have not tried to compile your code, I simply tested two gromacs runs on the same node with -gres=gpu:1 options.) On 11/13/19 5:17 PM, Renfro, Michael wrote: Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job. Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc): = // From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu #include void printDevProp(cudaDeviceProp dP) { printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount); printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID); } int main() { // Number of CUDA devices int devCount; cudaGetDeviceCount(&devCount); printf("There are %d CUDA devices.\n", devCount); // Iterate through devices for (int i = 0; i < devCount; ++i) { // Get device properties printf("CUDA Device #%d: ", i); cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i); printDevProp(devProp); } return 0; } = When run from two simultaneous jobs on the same node (each with a gres=gpu), I get: = [renfro@gpunode003(job 221584) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 5, DeviceID 0 = [renfro@gpunode003(job 221585) hw]$ ./cuda_props There are 1 CUDA devices. CUDA Device #0: Tesla K80 has 13 multiprocessors Tesla K80 has PCI BusID 6, DeviceID 0 = -- Tamas Hegedus, PhD Senior Research Fellow Department of Biophysics and Radiation Biology Semmelweis University | phone: (36) 1-459 1500/60233 Tuzolto utca 37-47| mailto:ta...@hegelab.org Budapest, 1094, Hungary | http://www.hegelab.org
Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7
On 13-11-2019 18:04, Bas van der Vlies wrote: We have currently version 18.08.7 installed on our cluster and want to upgrade to 19.03.3.. So I wanted to start small and installed it one of our compute node. Buy if I start the 'slurmd' then our slurmctld will complain that: {{{ 2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible versions of client and server code [2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: Unspecified error [2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid Protocol Version 8704 from uid=-1 at 10.10.0.40:32548 [2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible versions of client and server code }}} I have read about the RPC protocol: * https://slurm.schedmd.com/rpc.html Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this setup supported and something else goes wrong? Hi Bas, Your order of upgrading is *disrecommended*, see for example page 6 in the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in the page https://slurm.schedmd.com/publications.html Versions may be mixed as follows: slurmdbd >= slurmctld >= slurmd >= commands Perhaps you may find some useful further information in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation /Ole
Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7
On 11/13/19 10:42 AM, Ole Holm Nielsen wrote: Your order of upgrading is *disrecommended*, see for example page 6 in the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in the page https://slurm.schedmd.com/publications.html Also the documentation for upgrading here: https://slurm.schedmd.com/quickstart_admin.html#upgrade As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally slurmd's. This is required because of the way the RPC protocol support for older versions works. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7
Hi Bas, Your order of upgrading is *disrecommended*, see for example page 6 in the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in the page https://slurm.schedmd.com/publications.html Versions may be mixed as follows: slurmdbd >= slurmctld >= slurmd >= commands Thanks a lot Ole. This helps a a lot. Regards -- Bas van der Vlies | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam | T +31 (0) 20 800 1300 | bas.vandervl...@surfsara.nl | www.surfsara.nl |
Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7
On 11/13/19 8:36 PM, Christopher Samuel wrote: https://slurm.schedmd.com/quickstart_admin.html#upgrade As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally slurmd's. This is required because of the way the RPC protocol support for older versions works. Thanks Chris I also found the above link. I read the RPC documentation wrong and now have the correct procedure for upgrading -- Bas van der Vlies | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG Amsterdam | T +31 (0) 20 800 1300 | bas.vandervl...@surfsara.nl | www.surfsara.nl |
Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected
On Wednesday, 13 November 2019 10:11:30 AM PST Tamas Hegedus wrote: > Thanks for your suggestion. You are right, I do not have to deal with > specific GPUs. > (I have not tried to compile your code, I simply tested two gromacs runs > on the same node with -gres=gpu:1 options.) How are you controlling access to GPUs? Is that via cgroups? If so you should be fine, but if you're not using cgroups to control access then you may well find that they are sharing the same GPU. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA