Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
Sorry, forgot, we use by the way, slurm 18.08.7 I just saw, in an earlier coredump, that there is another (earlier) line involved: 2136: if (row2[ASSOC2_REQ_MTPJ][0]) the corresponding mysql response was: +-+--+--+--+--+---+---+---++---+-+--++ | @par_id | @mj | @mja | @mpt | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm | @def_qos_id | @qos | @delta_qos | +-+--+--+--+--+---+---+---++---+-+--++ | 990 | 800 | NULL | NULL | 1000 | 1440 | NULL | NULL | NULL | NULL |NULL | ,1, | NULL | +-+--+--+--+--+---+---+---++---+-+--++ 1 row in set (0.00 sec) So, here, @mtpj is NULL, in the other coredump, it was "1=8", so it was not NULL. But @mtpn was NULL, so it segfaulted in 2141:if (row2[ASSOC2_REQ_MTPN][0]) Could anyone with a not segfaulting slurmdbd please use the call directly in the database (it is a procedure generated by slurmdbd collecting the parent limits of an association) and report the result here? Best Marcus Am 06.05.2020 um 09:49 schrieb Ben Polman: On 06-05-2020 07:38, Chris Samuel wrote: We are experiencing exactly the same problem after mysql upgrade to 5.7.30, moving database to old mysql server running 5.6 solves the problem. Most likely downgrading mysql to 5.7.29 will work as well I have no clue which change in mysql-server is causing this best regards, Ben On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: Since this happens on a fresh new database, I just don't understand how I can get back to a basic functional state. This is exceedingly frustrating. I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this only started when your colleague upgraded MySQL then this sounds like MySQL is triggering this problem. We're running with MariaDB 10.x (from SLES15) without issues (our database is huge). All the best, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
Hi, same here :/ the segfault happens after the procedure call in mysql: call get_parent_limits('assoc_table', 'rwth0515', 'rcc', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos; The mysql answer is: +-+--+--+--+--+---+---+---++---+-+--++ | @par_id | @mj | @mja | @mpt | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | @mtrm | @def_qos_id | @qos | @delta_qos | +-+--+--+--+--+---+---+---++---+-+--++ |5312 | 800 | NULL | NULL | 1000 | 1440 | 1=8 | NULL | NULL | NULL |NULL | ,1, | NULL | +-+--+--+--+--+---+---+---++---+-+--++ the segfault happens in as_mysql_assoc.c: #0 0x2ae3dea6c05a in _cluster_get_assocs (mysql_conn=mysql_conn@entry=0x2ae3f4000d70, user=user@entry=0x2ae3e1feca90, assoc_cond=assoc_cond@entry=0x2ae3f40009f0, cluster_name=0x63f110 "rcc", fields=, sent_extra=, is_admin=is_admin@entry=true, sent_list=sent_list@entry=0x6dc030) at as_mysql_assoc.c:2141 2141if (row2[ASSOC2_REQ_MTPN][0]) hope that helps anyone. Best Marcus Am 06.05.2020 um 09:49 schrieb Ben Polman: On 06-05-2020 07:38, Chris Samuel wrote: We are experiencing exactly the same problem after mysql upgrade to 5.7.30, moving database to old mysql server running 5.6 solves the problem. Most likely downgrading mysql to 5.7.29 will work as well I have no clue which change in mysql-server is causing this best regards, Ben On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: Since this happens on a fresh new database, I just don't understand how I can get back to a basic functional state. This is exceedingly frustrating. I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this only started when your colleague upgraded MySQL then this sounds like MySQL is triggering this problem. We're running with MariaDB 10.x (from SLES15) without issues (our database is huge). All the best, Chris smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
On 06-05-2020 07:38, Chris Samuel wrote: We are experiencing exactly the same problem after mysql upgrade to 5.7.30, moving database to old mysql server running 5.6 solves the problem. Most likely downgrading mysql to 5.7.29 will work as well I have no clue which change in mysql-server is causing this best regards, Ben > On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: > >> Since this happens on a fresh new database, I just don't understand how I >> can get back to a basic functional state. This is exceedingly frustrating. > I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this > only started when your colleague upgraded MySQL then this sounds like MySQL > is > triggering this problem. > > We're running with MariaDB 10.x (from SLES15) without issues (our database is > huge). > > All the best, > Chris -- - Dr. B.J.W. Polman, C&CZ, Radboud University Nijmegen. Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: > Since this happens on a fresh new database, I just don't understand how I > can get back to a basic functional state. This is exceedingly frustrating. I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this only started when your colleague upgraded MySQL then this sounds like MySQL is triggering this problem. We're running with MariaDB 10.x (from SLES15) without issues (our database is huge). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] "sacctmgr add cluster" crashing slurmdbd
Hi, I've just upgraded to slurm 19.05.5. With either my old database, OR creating an entirely new database, I am unable to create a new 'cluster' entry in the database -- slurmdbd is segfaulting! # sacctmgr add cluster test3 Adding Cluster(s) Name = test3 Would you like to commit changes? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to mn001:6819: Connection refused sacctmgr: error: slurmdbd: Getting response to message type: DBD_ADD_CLUSTERS Problem adding clusters: Unspecified error sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused Meanwhile, running "slurmdbd -D -v -v -v -v -v", I see [2020-05-05T18:17:19.503] debug4: 10(as_mysql_cluster.c:405) query insert into txn_table (timestamp, action, name, actor, info) values (1588717037, 1405, 'test3', 'root', 'mod_time=1588717037, shares=1, grp_jobs=NULL, grp_jobs_accrue=NULL, grp_submit_jobs=NULL, grp_wall=NULL, max_jobs=NULL, max_jobs_accrue=NULL, min_prio_thresh=NULL, max_submit_jobs=NULL, max_wall_pj=NULL, priority=NULL, def_qos_id=NULL, qos=\',1,\', federation=\'\', fed_id=0, fed_state=0, features=\'\''); slurmdbd: debug4: 10(as_mysql_assoc.c:635) query select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and acct='root'; [2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:635) query select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and acct='root'; slurmdbd: debug4: 10(as_mysql_assoc.c:714) query call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; [2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:714) query call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id, @qos, @delta_qos, @prio; Segmentation fault (core dumped) Since this happens on a fresh new database, I just don't understand how I can get back to a basic functional state. This is exceedingly frustrating. Thanks for any hints. --dustin