Hello, Our GRID cluster has so far been using slurm version 14.11.4 on an el6 system and I wish to upgrade it to the newest version. The current cluster includes a master node (which is also a job execution node) and three other job execution node.
I performed the upgrade using (with RPM built with rpmbuild from slurm tar-ball that I downloaded from http://www.schedmd.com/#repos) on the master node: ------------------------------ rpmbuild -tb slurm-15.08.1.tar.bz2 service slurm stop service slurmdbd stop service munge stop rpm -Uvh ./slurm-slurmdbd-15.08.1-1.el6.x86_64.rpm rpm -Uvh ./slurm-15.08.1-1.el6.x86_64.rpm rpm -Uvh ./slurm-munge-15.08.1-1.el6.x86_64.rpm rpm -Uvh ./slurm-plugins-15.08.1-1.el6.x86_64.rpm rpm -Uvh ./slurm-sql-15.08.1-1.el6.x86_64.rpm service munge start service slurmdbd start service slurm start ------------------------------ At the same time, the RPMs were also installed on the remaining three nodes and the slurm service was started on each. I already performed an upgrade from 2.5.4 to 14.11.4 last year and everything went smoothly. This time, however, the slurmctld does not want to start: ----- # service slurm restart stopping slurmctld: [FAILED] slurmctld is stopped slurmctld is stopped stopping slurmd: [ OK ] slurmd is stopped starting slurmctld: [ OK ] starting slurmd: [ OK ] ----- # service slurm status slurmctld is stopped slurmctld is stopped slurmd (pid 18922) is running... ----- Here is the relevant transcript of the slurmd, slurmdbd and slurmctld logs from the master node: ---------------------------- -- slurmctld.log (master) -- ---------------------------- [2015-10-12T18:03:46.276] Terminate signal (SIGINT or SIGTERM) received [2015-10-12T18:03:46.343] Saving all slurm state [2015-10-12T18:03:46.430] layouts: all layouts are now unloaded. [2015-10-12T18:14:19.120] slurmctld version 15.08.1 started on cluster ung [2015-10-12T18:14:20.866] error: _get_assoc_mgr_tres_list: no list was made. [2015-10-12T18:14:20.866] error: Association database appears down, reading from state file. [2015-10-12T18:14:20.866] fatal: load_assoc_mgr_state: Unable to run cache without TRES, please make sure you have a connection to your database to continue. ------------------------- -- slurmd.log (master) -- ------------------------- [2015-10-12T18:03:47.386] Slurmd shutdown completing [2015-10-12T18:14:19.132] Message aggregation disabled [2015-10-12T18:14:19.133] CPU frequency setting not configured for this node [2015-10-12T18:14:19.133] Resource spec: Reserved system memory limit not configured for this node [2015-10-12T18:14:19.135] slurmd version 15.08.1 started [2015-10-12T18:14:19.136] slurmd started on Mon, 12 Oct 2015 18:14:19 +0200 [2015-10-12T18:14:19.136] CPUs=24 Boards=1 Sockets=2 Cores=12 Threads=1 Memory=80580 TmpDisk=255936 Uptime=944341 CPUSpecList=(null) [2015-10-12T18:14:28.138] error: Unable to register: Unable to contact slurm controller (connect failure) [2015-10-12T18:14:38.139] error: Unable to register: Unable to contact slurm controller (connect failure) [2015-10-12T18:14:48.141] error: Unable to register: Unable to contact slurm controller (connect failure) --------------------------- -- slurmdbd.log (master) -- --------------------------- [2015-10-12T18:03:53.282] Terminate signal (SIGINT or SIGTERM) received [2015-10-12T18:03:53.289] Unable to remove pidfile '/var/run/slurmdbd.pid': Permission denied [2015-10-12T18:13:50.858] Updating database tables, this may take some time, do not stop the process. [2015-10-12T18:13:50.884] adding column max_tres_pj after max_nodes_pj in table "ung_assoc_table" [2015-10-12T18:13:50.884] adding column max_tres_mins_pj after max_tres_pj in table "ung_assoc_table" [2015-10-12T18:13:50.884] adding column max_tres_run_mins after max_tres_mins_pj in table "ung_assoc_table" [2015-10-12T18:13:50.884] adding column grp_tres after grp_nodes in table "ung_assoc_table" [2015-10-12T18:13:50.885] adding column grp_tres_mins after grp_tres in table "ung_assoc_table" [2015-10-12T18:13:50.885] adding column grp_tres_run_mins after grp_tres_mins in table "ung_assoc_table" [2015-10-12T18:13:50.885] adding key account (acct(20)) to table "ung_assoc_table" [2015-10-12T18:13:50.986] converting assoc table for ung [2015-10-12T18:13:51.017] adding column max_tres_pj after max_submit_jobs_per_user in table qos_table [2015-10-12T18:13:51.017] adding column max_tres_pu after max_tres_pj in table qos_table [2015-10-12T18:13:51.017] adding column max_tres_mins_pj after max_tres_pu in table qos_table [2015-10-12T18:13:51.017] adding column max_tres_run_mins_pu after max_tres_mins_pj in table qos_table [2015-10-12T18:13:51.017] adding column min_tres_pj after max_tres_run_mins_pu in table qos_table [2015-10-12T18:13:51.017] adding column grp_tres after grp_submit_jobs in table qos_table [2015-10-12T18:13:51.017] adding column grp_tres_mins after grp_tres in table qos_table [2015-10-12T18:13:51.017] adding column grp_tres_run_mins after grp_tres_mins in table qos_table [2015-10-12T18:13:51.114] adding column id_tres after id_assoc in table "ung_assoc_usage_day_table" [2015-10-12T18:13:51.224] adding column id_tres after id_assoc in table "ung_assoc_usage_hour_table" [2015-10-12T18:13:51.383] adding column id_tres after id_assoc in table "ung_assoc_usage_month_table" [2015-10-12T18:13:51.620] adding column id_tres after deleted in table "ung_usage_day_table" [2015-10-12T18:13:51.791] adding column id_tres after deleted in table "ung_usage_hour_table" [2015-10-12T18:13:52.382] adding column id_tres after deleted in table "ung_usage_month_table" [2015-10-12T18:13:52.504] adding column tres after state in table "ung_event_table" [2015-10-12T18:13:52.638] adding column tres_alloc after track_steps in table "ung_job_table" [2015-10-12T18:13:54.001] adding column tres after time_end in table "ung_resv_table" [2015-10-12T18:13:54.140] adding column tres_alloc after ave_disk_write in table "ung_step_table" [2015-10-12T18:13:54.927] adding column id_tres after id_wckey in table "ung_wckey_usage_day_table" [2015-10-12T18:13:55.026] adding column id_tres after id_wckey in table "ung_wckey_usage_hour_table" [2015-10-12T18:13:55.135] adding column id_tres after id_wckey in table "ung_wckey_usage_month_table" [2015-10-12T18:13:55.207] converting event table for ung [2015-10-12T18:13:55.208] converting cluster usage tables for ung [2015-10-12T18:13:56.814] converting job table for ung [2015-10-12T18:13:57.593] converting reservation table for ung [2015-10-12T18:13:57.594] converting step table for ung [2015-10-12T18:13:58.469] Conversion done: success! [2015-10-12T18:13:58.492] adding column max_tres_pn after max_tres_pj in table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column max_cpus_pj from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column max_nodes_pj from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column max_cpu_mins_pj from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column max_cpu_run_mins from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column grp_cpus from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column grp_mem from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column grp_nodes from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column grp_cpu_mins from table "ung_assoc_table" [2015-10-12T18:13:58.492] dropping column grp_cpu_run_mins from table "ung_assoc_table" [2015-10-12T18:13:58.761] dropping column consumed_energy from table "ung_assoc_usage_day_table" [2015-10-12T18:13:58.896] dropping column consumed_energy from table "ung_assoc_usage_hour_table" [2015-10-12T18:13:59.077] dropping column consumed_energy from table "ung_assoc_usage_month_table" [2015-10-12T18:13:59.220] dropping column consumed_energy from table "ung_usage_day_table" [2015-10-12T18:13:59.345] dropping column consumed_energy from table "ung_usage_hour_table" [2015-10-12T18:13:59.829] dropping column consumed_energy from table "ung_usage_month_table" [2015-10-12T18:13:59.951] dropping column cpu_count from table "ung_event_table" [2015-10-12T18:14:00.216] adding column tres_req after tres_alloc in table "ung_job_table" [2015-10-12T18:14:00.216] dropping column cpus_alloc from table "ung_job_table" [2015-10-12T18:14:00.217] adding key rollup2 (time_end, time_eligible) to table "ung_job_table" [2015-10-12T18:14:00.217] adding key nodes_alloc (nodes_alloc) to table "ung_job_table" [2015-10-12T18:14:00.217] adding key sacct_def2 (id_user, time_end, time_eligible) to table "ung_job_table" [2015-10-12T18:14:01.706] dropping column cpus from table "ung_resv_table" [2015-10-12T18:14:01.803] adding column req_cpufreq_min after consumed_energy in table "ung_step_table" [2015-10-12T18:14:01.803] adding column req_cpufreq_gov after req_cpufreq in table "ung_step_table" [2015-10-12T18:14:01.803] dropping column cpus_alloc from table "ung_step_table" [2015-10-12T18:14:02.979] dropping column resv_secs from table "ung_wckey_usage_day_table" [2015-10-12T18:14:02.980] dropping column over_secs from table "ung_wckey_usage_day_table" [2015-10-12T18:14:02.980] dropping column consumed_energy from table "ung_wckey_usage_day_table" [2015-10-12T18:14:03.077] dropping column resv_secs from table "ung_wckey_usage_hour_table" [2015-10-12T18:14:03.078] dropping column over_secs from table "ung_wckey_usage_hour_table" [2015-10-12T18:14:03.078] dropping column consumed_energy from table "ung_wckey_usage_hour_table" [2015-10-12T18:14:03.185] dropping column resv_secs from table "ung_wckey_usage_month_table" [2015-10-12T18:14:03.185] dropping column over_secs from table "ung_wckey_usage_month_table" [2015-10-12T18:14:03.185] dropping column consumed_energy from table "ung_wckey_usage_month_table" [2015-10-12T18:14:03.525] adding column max_tres_pn after max_tres_pj in table qos_table [2015-10-12T18:14:03.525] dropping column max_cpus_per_job from table qos_table [2015-10-12T18:14:03.525] dropping column max_cpus_per_user from table qos_table [2015-10-12T18:14:03.525] dropping column max_nodes_per_job from table qos_table [2015-10-12T18:14:03.525] dropping column max_nodes_per_user from table qos_table [2015-10-12T18:14:03.525] dropping column max_cpu_mins_per_job from table qos_table [2015-10-12T18:14:03.525] dropping column max_cpu_run_mins_per_user from table qos_table [2015-10-12T18:14:03.525] dropping column grp_cpus from table qos_table [2015-10-12T18:14:03.525] dropping column grp_mem from table qos_table [2015-10-12T18:14:03.525] dropping column grp_nodes from table qos_table [2015-10-12T18:14:03.525] dropping column grp_cpu_mins from table qos_table [2015-10-12T18:14:03.525] dropping column grp_cpu_run_mins from table qos_table [2015-10-12T18:14:03.525] dropping column min_cpus_per_job from table qos_table [2015-10-12T18:14:03.677] Accounting storage MYSQL plugin loaded [2015-10-12T18:14:03.677] pidfile not locked, assuming no running daemon [2015-10-12T18:14:03.688] slurmdbd version 15.08.1 started And here is the relevant transcript of the slurmd on one of the job execution nodes: ------------------------------------- -- slurmd.log (job execution node) -- ------------------------------------- [2015-10-12T18:02:42.392] Slurmd shutdown completing [2015-10-12T18:14:21.072] Node configuration differs from hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=8:4(hw) CoresPerSocket=8:8(hw) ThreadsPerCore=1:2(hw) [2015-10-12T18:14:21.072] Message aggregation disabled [2015-10-12T18:14:21.073] CPU frequency setting not configured for this node [2015-10-12T18:14:21.073] Resource spec: Reserved system memory limit not configured for this node [2015-10-12T18:14:21.076] slurmd version 15.08.1 started [2015-10-12T18:14:21.076] slurmd started on Mon, 12 Oct 2015 18:14:21 +0200 [2015-10-12T18:14:21.077] CPUs=64 Boards=1 Sockets=8 Cores=8 Threads=1 Memory=129035 TmpDisk=201295 Uptime=944136 CPUSpecList=(null) [2015-10-12T18:14:30.080] error: Unable to register: Unable to contact slurm controller (connect failure) [2015-10-12T18:14:40.084] error: Unable to register: Unable to contact slurm controller (connect failure) [2015-10-12T18:14:50.087] error: Unable to register: Unable to contact slurm controller (connect failure) ------------------------------------- It seems to me that the slurm controller daemon has a problem accessing the slurm database. At the same time, I also upgraded the nordugrid software, ntp and fetch-crl to their newest versions. Please let me know, if there is any extra information you need to solve this problem. Regards, Gasper Kukec Mezek