After discovering issues with sreport and tracing back to Slurm DB orphans 
(running AND pending), working to resolve that problem and clearing DB or 
running and pending orphans I find that rollups are still not running 
automatically.

For example, I triggered re-rollup this morning by modifying the entries in the 
last_ran_table and restarting slurmdbd and rollups were redone from April 1, 
2016 to today at 06:00 AM.


[tcooper@cluster ~]# show_last_ran_table.sh
+---------------------+---------------------+---------------------+
| hourly_rollup       | daily_rollup        | monthly_rollup      |
+---------------------+---------------------+---------------------+
| 2017-06-09 06:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 |
+---------------------+---------------------+---------------------+


Three hours later and the hourly rollups for [07-09]:00 AM had not run. Restart 
of slurmdbd triggers the rollups which run successfully...


[tcooper@cluster ~]# service slurmdbd restart
stopping slurmdbd:                                         [  OK  ]
slurmdbd (pid 30526) is running...
slurmdbd (pid 30526) is running...
slurmdbd (pid 30526) is running...
slurmdbd (pid 30526) is running...
starting slurmdbd:                                         [  OK  ]


[tcooper@cluster ~]# tailf /var/log/slurm/slurmdbd.log | egrep -v "post user"
...
[2017-06-09T09:22:16.553] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:9
[2017-06-09T09:22:16.644] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:8
[2017-06-09T09:24:07.700] Terminate signal (SIGINT or SIGTERM) received
[2017-06-09T09:24:18.077] debug:  auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
[2017-06-09T09:24:19.190] Accounting storage MYSQL plugin loaded
[2017-06-09T09:24:28.114] slurmdbd version 14.11.11 started
[2017-06-09T09:24:28.115] 0(as_mysql_rollup.c:622) cluster curr hour is now 
1497013200-1497016800
...
[2017-06-09T09:24:41.831] Warning: Note very large processing time from 
hourly_rollup for cluster: usec=13715990 began=09:24:28.115
[2017-06-09T09:24:41.831] 0(as_mysql_usage.c:376) query
update "cluster_last_ran_table" set hourly_rollup=1497024000
[2017-06-09T09:25:16.746] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:9
[2017-06-09T09:25:16.839] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:8
...


[tcooper@cluster ~]# show_last_ran_table.sh
+---------------------+---------------------+---------------------+
| hourly_rollup       | daily_rollup        | monthly_rollup      |
+---------------------+---------------------+---------------------+
| 2017-06-09 09:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 |
+---------------------+---------------------+---------------------+

Before our DB orphan issue generation of rollups was NOT a problem.

Can anyone provide insight into where the 'trigger' for hourly rollups is 
supposed to come from and/or know if this is a bug in Slurm 14.11.11 that is 
fixed in a later version?


Thanks,

Trevor Cooper
HPC Systems Programmer
San Diego Supercomputer Center, UCSD
9500 Gilman Drive, 0505
La Jolla, CA 92093-0505

Reply via email to