We believe we have resolved this by recreating some recently deleted QOS.

The controller starts again OK. But the QOS have been recreated with new ID, so 
it's unclear how that would have fixed it.

And if the problem was an invalid QOS why wasn't this picked up rather than seg 
faulting?

The code at 
https://github.com/SchedMD/slurm/blob/slurm-20.11/src/common/slurmdb_defs.c#L548:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7

Copyright (C) 2013 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-redhat-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /usr/sbin/slurmctld...done.

[New LWP 147686]

[New LWP 147687]

[New LWP 147688]

[New LWP 147690]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `/usr/sbin/slurmctld -D -vvv'.

Program terminated with signal 11, Segmentation fault.

#0  0x00007f4f423c4671 in bit_set (b=b@entry=0x2e06c40, 
bit=bit@entry=-518653972) at bitstring.c:237

237             b[_bit_word(bit)] |= _bit_mask(bit);

Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-20.11.9-1.el7.x86_64

(gdb) bt

#0  0x00007f4f423c4671 in bit_set (b=b@entry=0x2e06c40, 
bit=bit@entry=-518653972) at bitstring.c:237

#1  0x00007f4f4248af5e in _set_qos_bit_from_string 
(valid_qos=valid_qos@entry=0x2e06c40, name=<optimized out>) at 
slurmdb_defs.c:548

#2  0x00007f4f4248eedc in set_qos_bitstr_from_list (valid_qos=0x2e06c40, 
qos_list=<optimized out>) at slurmdb_defs.c:2288

#3  0x00007f4f423b896d in _set_assoc_parent_and_user 
(assoc=assoc@entry=0x25e3500, reset=reset@entry=0) at assoc_mgr.c:885

#4  0x00007f4f423bfae7 in _post_assoc_list () at assoc_mgr.c:1034

#5  0x00007f4f423c2093 in _get_assoc_mgr_assoc_list (enforce=27, 
db_conn=0x20f38d0) at assoc_mgr.c:1584

#6  assoc_mgr_init (db_conn=0x20f38d0, args=args@entry=0x7fff2bc95d10, 
db_conn_errno=<optimized out>) at assoc_mgr.c:2039

#7  0x000000000042bae5 in ctld_assoc_mgr_init () at controller.c:2323

#8  0x000000000042de95 in main (argc=<optimized out>, argv=<optimized out>) at 
controller.c:620

(gdb)


The problem line is:

(*(my_function))(valid_qos, bit);

But 17 lines above that, the line

xassert(valid_qos);

Appears to check if the qos is valid - but in this case the qos existed but was 
deleted - should there be more checks here for a deleted QOS?

Would have an assert been triggered if we had compiled slurmd with 
--enable-developer as per https://slurm.schedmd.com/faq.html#debug ?

Many thanks,

Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don't work on Monday.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Luke 
Sudbery
Sent: 07 July 2022 12:28
To: slurm-us...@schedmd.com
Subject: [slurm-users] Slurmctl seg faulting

Hello,

After a restart this morning*, slurmctl is starting but then failing before it 
is up and running:


slurmctld: debug2: user user1 default acct is account1

slurmctld: debug2: assoc 68939(account1, user2) has direct parent of 
26369(account1, (null))

slurmctld: debug2: user user2 default acct is account1

slurmctld: debug2: assoc 26396(account1, user3) has direct parent of 
26369(account1, (null))

slurmctld: debug2: user user3 default acct is account1

slurmctld: debug2: assoc 26390(account1, user4) has direct parent of 
26369(account1, (null))

slurmctld: debug2: user user4 default acct is account1

slurmctld: debug2: assoc 26342(account2, (null)) has direct parent of 3(root, 
(null))

slurmctld: debug2: assoc 76299(account2, rq-worker) has direct parent of 
26342(account2, (null))

Segmentation fault (core dumped)

[root@bb-er-slurm01 state]#

Which give us very little to go on.

Earlier, there was a long list of

slurmctld: error: assoc 62346 doesn't have access to it's default qos 'qos1'

slurmctld: error: assoc 61271 doesn't have access to it's default qos 'qos1'

slurmctld: error: assoc 61221 doesn't have access to it's default qos 'qos1'

slurmctld: error: assoc 61215 doesn't have access to it's default qos 'qos1'

Shortly before the segfault. We have cleared those, by setting correct default 
qos with sacctmgr, but the controller is still seg faulting.

*The restart was initiated by our HA cluster (pacemaker/corosync) and at one 
point 2 copies of slurmdbd may have been running. But the second copy appeared 
to detect this and back off? Not sure:

Jul 07 09:06:47 bb-aw-slurm01.bear.cluster systemd[1]: Started Cluster 
Controlled slurmdbd.

Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: debug:  Log file 
re-opened

Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: pidfile not 
locked, assuming no running daemon

Jul 07 09:06:47 bb-aw-slurm01.bear.cluster slurmdbd[212307]: debug:  
auth/munge: init: Munge authentication plugin loaded

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: 
accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version 
is: 10.3.32-MariaDB

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: 
_mysql_query_internal: deadlock detected attempt 1/10: 1213 WSREP replication 
failed. Check your wsrep connection state and retry the query.

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: mysql_query 
failed: 1047 WSREP has not yet prepared node for application use

                                                             create table if 
not exists table_defs_table (creation_time int unsigned not null, mod_time int 
unsigned default 0 not null, table_name text not null,

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: 
accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: Couldn't 
load specified plugin name for accounting_storage/mysql: Plugin init() callback 
failed

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster slurmdbd[212307]: error: cannot 
create accounting_storage context for accounting_storage/mysql

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: slurmdbd.service: main 
process exited, code=exited, status=1/FAILURE

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: Unit slurmdbd.service 
entered failed state.

Jul 07 09:06:52 bb-aw-slurm01.bear.cluster systemd[1]: slurmdbd.service failed.

But either way, we suspect database corruption may be the problem, but slurmdbd 
logs don't give anything to go on.

So questions,

  1.  How can we troubleshoot further? I am currently attempting to rebuild 
slurm with debugging symbols (https://slurm.schedmd.com/faq.html#debug) as gdb 
is not giving any useful information at all ATM.
  2.  What are the implications of restoring the database? How does slurm 
reconcile the database and the state directory?

Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don't work on Monday.

Reply via email to