Hello,

Our GRID cluster has so far been using slurm version 14.11.4 on an el6
system and I wish to upgrade it to the newest version. The current cluster
includes a master node (which is also a job execution node) and three
other job execution node.

I performed the upgrade using (with RPM built with rpmbuild from slurm
tar-ball that I downloaded from http://www.schedmd.com/#repos) on the
master node:
------------------------------
rpmbuild -tb slurm-15.08.1.tar.bz2
service slurm stop
service slurmdbd stop
service munge stop
rpm -Uvh ./slurm-slurmdbd-15.08.1-1.el6.x86_64.rpm
rpm -Uvh ./slurm-15.08.1-1.el6.x86_64.rpm
rpm -Uvh ./slurm-munge-15.08.1-1.el6.x86_64.rpm
rpm -Uvh ./slurm-plugins-15.08.1-1.el6.x86_64.rpm
rpm -Uvh ./slurm-sql-15.08.1-1.el6.x86_64.rpm
service munge start
service slurmdbd start
service slurm start
------------------------------

At the same time, the RPMs were also installed on the remaining three
nodes and the slurm service was started on each. I already performed an
upgrade from 2.5.4 to 14.11.4 last year and everything went smoothly. This
time, however, the slurmctld does not want to start:
-----
# service slurm restart
stopping slurmctld:                                        [FAILED]
slurmctld is stopped
slurmctld is stopped
stopping slurmd:                                           [  OK  ]
slurmd is stopped
starting slurmctld:                                        [  OK  ]
starting slurmd:                                           [  OK  ]
-----
# service slurm status
slurmctld is stopped
slurmctld is stopped
slurmd (pid 18922) is running...
-----

Here is the relevant transcript of the slurmd, slurmdbd and slurmctld logs
from the master node:
----------------------------
-- slurmctld.log (master) --
----------------------------
[2015-10-12T18:03:46.276] Terminate signal (SIGINT or SIGTERM) received
[2015-10-12T18:03:46.343] Saving all slurm state
[2015-10-12T18:03:46.430] layouts: all layouts are now unloaded.
[2015-10-12T18:14:19.120] slurmctld version 15.08.1 started on cluster ung
[2015-10-12T18:14:20.866] error: _get_assoc_mgr_tres_list: no list was made.
[2015-10-12T18:14:20.866] error: Association database appears down,
reading from state file.
[2015-10-12T18:14:20.866] fatal: load_assoc_mgr_state: Unable to run cache
without TRES, please make sure you have a connection to your database to
continue.

-------------------------
-- slurmd.log (master) --
-------------------------
[2015-10-12T18:03:47.386] Slurmd shutdown completing
[2015-10-12T18:14:19.132] Message aggregation disabled
[2015-10-12T18:14:19.133] CPU frequency setting not configured for this node
[2015-10-12T18:14:19.133] Resource spec: Reserved system memory limit not
configured for this node
[2015-10-12T18:14:19.135] slurmd version 15.08.1 started
[2015-10-12T18:14:19.136] slurmd started on Mon, 12 Oct 2015 18:14:19 +0200
[2015-10-12T18:14:19.136] CPUs=24 Boards=1 Sockets=2 Cores=12 Threads=1
Memory=80580 TmpDisk=255936 Uptime=944341 CPUSpecList=(null)
[2015-10-12T18:14:28.138] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2015-10-12T18:14:38.139] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2015-10-12T18:14:48.141] error: Unable to register: Unable to contact
slurm controller (connect failure)

---------------------------
-- slurmdbd.log (master) --
---------------------------
[2015-10-12T18:03:53.282] Terminate signal (SIGINT or SIGTERM) received
[2015-10-12T18:03:53.289] Unable to remove pidfile
'/var/run/slurmdbd.pid': Permission denied
[2015-10-12T18:13:50.858] Updating database tables, this may take some
time, do not stop the process.
[2015-10-12T18:13:50.884] adding column max_tres_pj after max_nodes_pj in
table "ung_assoc_table"
[2015-10-12T18:13:50.884] adding column max_tres_mins_pj after max_tres_pj
in table "ung_assoc_table"
[2015-10-12T18:13:50.884] adding column max_tres_run_mins after
max_tres_mins_pj in table "ung_assoc_table"
[2015-10-12T18:13:50.884] adding column grp_tres after grp_nodes in table
"ung_assoc_table"
[2015-10-12T18:13:50.885] adding column grp_tres_mins after grp_tres in
table "ung_assoc_table"
[2015-10-12T18:13:50.885] adding column grp_tres_run_mins after
grp_tres_mins in table "ung_assoc_table"
[2015-10-12T18:13:50.885] adding key account (acct(20)) to table
"ung_assoc_table"
[2015-10-12T18:13:50.986] converting assoc table for ung
[2015-10-12T18:13:51.017] adding column max_tres_pj after
max_submit_jobs_per_user in table qos_table
[2015-10-12T18:13:51.017] adding column max_tres_pu after max_tres_pj in
table qos_table
[2015-10-12T18:13:51.017] adding column max_tres_mins_pj after max_tres_pu
in table qos_table
[2015-10-12T18:13:51.017] adding column max_tres_run_mins_pu after
max_tres_mins_pj in table qos_table
[2015-10-12T18:13:51.017] adding column min_tres_pj after
max_tres_run_mins_pu in table qos_table
[2015-10-12T18:13:51.017] adding column grp_tres after grp_submit_jobs in
table qos_table
[2015-10-12T18:13:51.017] adding column grp_tres_mins after grp_tres in
table qos_table
[2015-10-12T18:13:51.017] adding column grp_tres_run_mins after
grp_tres_mins in table qos_table
[2015-10-12T18:13:51.114] adding column id_tres after id_assoc in table
"ung_assoc_usage_day_table"
[2015-10-12T18:13:51.224] adding column id_tres after id_assoc in table
"ung_assoc_usage_hour_table"
[2015-10-12T18:13:51.383] adding column id_tres after id_assoc in table
"ung_assoc_usage_month_table"
[2015-10-12T18:13:51.620] adding column id_tres after deleted in table
"ung_usage_day_table"
[2015-10-12T18:13:51.791] adding column id_tres after deleted in table
"ung_usage_hour_table"
[2015-10-12T18:13:52.382] adding column id_tres after deleted in table
"ung_usage_month_table"
[2015-10-12T18:13:52.504] adding column tres after state in table
"ung_event_table"
[2015-10-12T18:13:52.638] adding column tres_alloc after track_steps in
table "ung_job_table"
[2015-10-12T18:13:54.001] adding column tres after time_end in table
"ung_resv_table"
[2015-10-12T18:13:54.140] adding column tres_alloc after ave_disk_write in
table "ung_step_table"
[2015-10-12T18:13:54.927] adding column id_tres after id_wckey in table
"ung_wckey_usage_day_table"
[2015-10-12T18:13:55.026] adding column id_tres after id_wckey in table
"ung_wckey_usage_hour_table"
[2015-10-12T18:13:55.135] adding column id_tres after id_wckey in table
"ung_wckey_usage_month_table"
[2015-10-12T18:13:55.207] converting event table for ung
[2015-10-12T18:13:55.208] converting cluster usage tables for ung
[2015-10-12T18:13:56.814] converting job table for ung
[2015-10-12T18:13:57.593] converting reservation table for ung
[2015-10-12T18:13:57.594] converting step table for ung
[2015-10-12T18:13:58.469] Conversion done: success!
[2015-10-12T18:13:58.492] adding column max_tres_pn after max_tres_pj in
table "ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column max_cpus_pj from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column max_nodes_pj from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column max_cpu_mins_pj from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column max_cpu_run_mins from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column grp_cpus from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column grp_mem from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column grp_nodes from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column grp_cpu_mins from table
"ung_assoc_table"
[2015-10-12T18:13:58.492] dropping column grp_cpu_run_mins from table
"ung_assoc_table"
[2015-10-12T18:13:58.761] dropping column consumed_energy from table
"ung_assoc_usage_day_table"
[2015-10-12T18:13:58.896] dropping column consumed_energy from table
"ung_assoc_usage_hour_table"
[2015-10-12T18:13:59.077] dropping column consumed_energy from table
"ung_assoc_usage_month_table"
[2015-10-12T18:13:59.220] dropping column consumed_energy from table
"ung_usage_day_table"
[2015-10-12T18:13:59.345] dropping column consumed_energy from table
"ung_usage_hour_table"
[2015-10-12T18:13:59.829] dropping column consumed_energy from table
"ung_usage_month_table"
[2015-10-12T18:13:59.951] dropping column cpu_count from table
"ung_event_table"
[2015-10-12T18:14:00.216] adding column tres_req after tres_alloc in table
"ung_job_table"
[2015-10-12T18:14:00.216] dropping column cpus_alloc from table
"ung_job_table"
[2015-10-12T18:14:00.217] adding key rollup2 (time_end, time_eligible) to
table "ung_job_table"
[2015-10-12T18:14:00.217] adding key nodes_alloc (nodes_alloc) to table
"ung_job_table"
[2015-10-12T18:14:00.217] adding key sacct_def2 (id_user, time_end,
time_eligible) to table "ung_job_table"
[2015-10-12T18:14:01.706] dropping column cpus from table "ung_resv_table"
[2015-10-12T18:14:01.803] adding column req_cpufreq_min after
consumed_energy in table "ung_step_table"
[2015-10-12T18:14:01.803] adding column req_cpufreq_gov after req_cpufreq
in table "ung_step_table"
[2015-10-12T18:14:01.803] dropping column cpus_alloc from table
"ung_step_table"
[2015-10-12T18:14:02.979] dropping column resv_secs from table
"ung_wckey_usage_day_table"
[2015-10-12T18:14:02.980] dropping column over_secs from table
"ung_wckey_usage_day_table"
[2015-10-12T18:14:02.980] dropping column consumed_energy from table
"ung_wckey_usage_day_table"
[2015-10-12T18:14:03.077] dropping column resv_secs from table
"ung_wckey_usage_hour_table"
[2015-10-12T18:14:03.078] dropping column over_secs from table
"ung_wckey_usage_hour_table"
[2015-10-12T18:14:03.078] dropping column consumed_energy from table
"ung_wckey_usage_hour_table"
[2015-10-12T18:14:03.185] dropping column resv_secs from table
"ung_wckey_usage_month_table"
[2015-10-12T18:14:03.185] dropping column over_secs from table
"ung_wckey_usage_month_table"
[2015-10-12T18:14:03.185] dropping column consumed_energy from table
"ung_wckey_usage_month_table"
[2015-10-12T18:14:03.525] adding column max_tres_pn after max_tres_pj in
table qos_table
[2015-10-12T18:14:03.525] dropping column max_cpus_per_job from table
qos_table
[2015-10-12T18:14:03.525] dropping column max_cpus_per_user from table
qos_table
[2015-10-12T18:14:03.525] dropping column max_nodes_per_job from table
qos_table
[2015-10-12T18:14:03.525] dropping column max_nodes_per_user from table
qos_table
[2015-10-12T18:14:03.525] dropping column max_cpu_mins_per_job from table
qos_table
[2015-10-12T18:14:03.525] dropping column max_cpu_run_mins_per_user from
table qos_table
[2015-10-12T18:14:03.525] dropping column grp_cpus from table qos_table
[2015-10-12T18:14:03.525] dropping column grp_mem from table qos_table
[2015-10-12T18:14:03.525] dropping column grp_nodes from table qos_table
[2015-10-12T18:14:03.525] dropping column grp_cpu_mins from table qos_table
[2015-10-12T18:14:03.525] dropping column grp_cpu_run_mins from table
qos_table
[2015-10-12T18:14:03.525] dropping column min_cpus_per_job from table
qos_table
[2015-10-12T18:14:03.677] Accounting storage MYSQL plugin loaded
[2015-10-12T18:14:03.677] pidfile not locked, assuming no running daemon
[2015-10-12T18:14:03.688] slurmdbd version 15.08.1 started

And here is the relevant transcript of the slurmd on one of the job
execution nodes:
-------------------------------------
-- slurmd.log (job execution node) --
-------------------------------------
[2015-10-12T18:02:42.392] Slurmd shutdown completing
[2015-10-12T18:14:21.072] Node configuration differs from hardware:
CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=8:4(hw)
CoresPerSocket=8:8(hw)
ThreadsPerCore=1:2(hw)
[2015-10-12T18:14:21.072] Message aggregation disabled
[2015-10-12T18:14:21.073] CPU frequency setting not configured for this node
[2015-10-12T18:14:21.073] Resource spec: Reserved system memory limit not
configured for this node
[2015-10-12T18:14:21.076] slurmd version 15.08.1 started
[2015-10-12T18:14:21.076] slurmd started on Mon, 12 Oct 2015 18:14:21 +0200
[2015-10-12T18:14:21.077] CPUs=64 Boards=1 Sockets=8 Cores=8 Threads=1
Memory=129035 TmpDisk=201295 Uptime=944136 CPUSpecList=(null)
[2015-10-12T18:14:30.080] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2015-10-12T18:14:40.084] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2015-10-12T18:14:50.087] error: Unable to register: Unable to contact
slurm controller (connect failure)
-------------------------------------

It seems to me that the slurm controller daemon has a problem accessing
the slurm database. At the same time, I also upgraded the nordugrid
software, ntp and fetch-crl to their newest versions. Please let me know,
if there is any extra information you need to solve this problem.

Regards,
Gasper Kukec Mezek

Reply via email to