Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-06 Thread Marcus Wagner

Sorry, forgot, we use by the way, slurm 18.08.7

I just saw, in an earlier coredump, that there is another (earlier) line 
involved:


2136: if (row2[ASSOC2_REQ_MTPJ][0])

the corresponding mysql response was:

+-+--+--+--+--+---+---+---++---+-+--++
| @par_id | @mj  | @mja | @mpt | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | 
@mtrm | @def_qos_id | @qos | @delta_qos |

+-+--+--+--+--+---+---+---++---+-+--++
| 990 |  800 | NULL | NULL | 1000 |  1440 | NULL  | NULL  | NULL   | 
NULL  |NULL | ,1,  | NULL   |

+-+--+--+--+--+---+---+---++---+-+--++
1 row in set (0.00 sec)


So, here, @mtpj is NULL, in the other coredump, it was "1=8", so it was 
not NULL. But @mtpn was NULL, so it segfaulted in


2141:if (row2[ASSOC2_REQ_MTPN][0])

Could anyone with a not segfaulting slurmdbd please use the call 
directly in the database (it is a procedure generated by slurmdbd 
collecting the parent limits of an association) and report the result here?



Best
Marcus

Am 06.05.2020 um 09:49 schrieb Ben Polman:

On 06-05-2020 07:38, Chris Samuel wrote:

We are experiencing exactly the same problem after mysql upgrade to 5.7.30,
moving database to old mysql server running 5.6 solves the problem.
Most likely downgrading mysql to 5.7.29 will work as well

I have no clue which change in mysql-server is causing this

best regards,
Ben


On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote:


Since this happens on a fresh new database, I just don't understand how I
can get back to a basic functional state.  This is exceedingly frustrating.

I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this
only started when your colleague upgraded MySQL then this sounds like MySQL is
triggering this problem.

We're running with MariaDB 10.x (from SLES15) without issues (our database is
huge).

All the best,
Chris







smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-06 Thread Marcus Wagner

Hi, same here :/

the segfault happens after the procedure call in mysql:

call get_parent_limits('assoc_table', 'rwth0515', 'rcc', 0); select 
@par_id, @mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, 
@def_qos_id, @qos, @delta_qos;


The mysql answer is:

+-+--+--+--+--+---+---+---++---+-+--++
| @par_id | @mj  | @mja | @mpt | @msj | @mwpj | @mtpj | @mtpn | @mtmpj | 
@mtrm | @def_qos_id | @qos | @delta_qos |

+-+--+--+--+--+---+---+---++---+-+--++
|5312 |  800 | NULL | NULL | 1000 |  1440 | 1=8   | NULL  | NULL   | 
NULL  |NULL | ,1,  | NULL   |

+-+--+--+--+--+---+---+---++---+-+--++

the segfault happens in as_mysql_assoc.c:

#0  0x2ae3dea6c05a in _cluster_get_assocs 
(mysql_conn=mysql_conn@entry=0x2ae3f4000d70, 
user=user@entry=0x2ae3e1feca90, 
assoc_cond=assoc_cond@entry=0x2ae3f40009f0, cluster_name=0x63f110 "rcc",
fields=, sent_extra=, 
is_admin=is_admin@entry=true, sent_list=sent_list@entry=0x6dc030) at 
as_mysql_assoc.c:2141

2141if (row2[ASSOC2_REQ_MTPN][0])

hope that helps anyone.

Best
Marcus

Am 06.05.2020 um 09:49 schrieb Ben Polman:

On 06-05-2020 07:38, Chris Samuel wrote:

We are experiencing exactly the same problem after mysql upgrade to 5.7.30,
moving database to old mysql server running 5.6 solves the problem.
Most likely downgrading mysql to 5.7.29 will work as well

I have no clue which change in mysql-server is causing this

best regards,
Ben


On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote:


Since this happens on a fresh new database, I just don't understand how I
can get back to a basic functional state.  This is exceedingly frustrating.

I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this
only started when your colleague upgraded MySQL then this sounds like MySQL is
triggering this problem.

We're running with MariaDB 10.x (from SLES15) without issues (our database is
huge).

All the best,
Chris







smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-06 Thread Ben Polman
On 06-05-2020 07:38, Chris Samuel wrote:

We are experiencing exactly the same problem after mysql upgrade to 5.7.30,
moving database to old mysql server running 5.6 solves the problem.
Most likely downgrading mysql to 5.7.29 will work as well

I have no clue which change in mysql-server is causing this

best regards,
Ben

> On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote:
>
>> Since this happens on a fresh new database, I just don't understand how I
>> can get back to a basic functional state.  This is exceedingly frustrating.
> I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this 
> only started when your colleague upgraded MySQL then this sounds like MySQL 
> is 
> triggering this problem.
>
> We're running with MariaDB 10.x (from SLES15) without issues (our database is 
> huge).
>
> All the best,
> Chris


-- 
-
Dr. B.J.W. Polman, C&CZ, Radboud University Nijmegen.
Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
e-mail: ben.pol...@science.ru.nl




Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-05 Thread Chris Samuel
On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote:

> Since this happens on a fresh new database, I just don't understand how I
> can get back to a basic functional state.  This is exceedingly frustrating.

I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this 
only started when your colleague upgraded MySQL then this sounds like MySQL is 
triggering this problem.

We're running with MariaDB 10.x (from SLES15) without issues (our database is 
huge).

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] "sacctmgr add cluster" crashing slurmdbd

2020-05-05 Thread Dustin Lang
Hi,

I've just upgraded to slurm 19.05.5.

With either my old database, OR creating an entirely new database, I am
unable to create a new 'cluster' entry in the database -- slurmdbd is
segfaulting!

# sacctmgr add cluster test3
 Adding Cluster(s)
  Name   = test3
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: slurm_persist_conn_open_without_init: failed to open
persistent connection to mn001:6819: Connection refused
sacctmgr: error: slurmdbd: Getting response to message type:
DBD_ADD_CLUSTERS
 Problem adding clusters: Unspecified error
sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused

Meanwhile, running "slurmdbd -D -v -v -v -v -v", I see

[2020-05-05T18:17:19.503] debug4: 10(as_mysql_cluster.c:405) query
insert into txn_table (timestamp, action, name, actor, info) values
(1588717037, 1405, 'test3', 'root', 'mod_time=1588717037, shares=1,
grp_jobs=NULL, grp_jobs_accrue=NULL, grp_submit_jobs=NULL, grp_wall=NULL,
max_jobs=NULL, max_jobs_accrue=NULL, min_prio_thresh=NULL,
max_submit_jobs=NULL, max_wall_pj=NULL, priority=NULL, def_qos_id=NULL,
qos=\',1,\', federation=\'\', fed_id=0, fed_state=0, features=\'\'');
slurmdbd: debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:635) query
select id_assoc from "test3_assoc_table" where user='' and deleted = 0 and
acct='root';
slurmdbd: debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
[2020-05-05T18:17:19.506] debug4: 10(as_mysql_assoc.c:714) query
call get_parent_limits('assoc_table', 'root', 'test3', 0); select @par_id,
@mj, @mja, @mpt, @msj, @mwpj, @mtpj, @mtpn, @mtmpj, @mtrm, @def_qos_id,
@qos, @delta_qos, @prio;
Segmentation fault (core dumped)


Since this happens on a fresh new database, I just don't understand how I
can get back to a basic functional state.  This is exceedingly frustrating.

Thanks for any hints.

--dustin