Well, you nailed it. Honestly a little surprised it was working to begin with.
In the DBD conf > -#DbdPort=7031 > +DbdPort=7031 And then in the slurm.conf > -#AccountingStoragePort=3306 > +AccountingStoragePort=7031 I’m not sure how my slurm.conf showed the 3306 mysql port commented out. I did confirm that the slurmdbd was listening on 6819 before, so I assumed that the default would be 6819 on the dbd and the “client” (ctld or otherwise) side, but somehow that wasn’t the case? Either way, I do feel like things are getting back to the right state. So thank you so much for pointing me in the correct direction. Thanks, Reed > On Jun 15, 2022, at 7:50 PM, Ryan Novosielski <novos...@rutgers.edu> wrote: > > Apologies for not having more concrete information available when I’m > replying to you, but I figured maybe having a fast hint might be better. > > Have a look at how the various daemons communicate with one another. This > sounds to me like a firewall thing between maybe the SlurmCtld and where the > SlurmDBD is running right now, or vice-versa or something like that. The > “scontrol show cluster” thing is a giveaway. That is populated dynamically, > not pulled from a config file exactly. > > I ran into this exact thing years ago, but can’t remember where the firewall > was the issue. > > Sent from my iPhone > >> On Jun 15, 2022, at 20:12, Reed Dier <reed.d...@focusvq.com> wrote: >> >> Hoping this is an easy answer. >> >> My mysql instance somehow corrupted itself, and I’m having to purge and >> start over. >> This is ok, because the data in there isn’t too valuable, and we aren’t >> making use of associations or anything like that yet (no >> AccountingStorageEnforce). >> >> That said, I’ve decided to put the dbd’s mysql instance on my main database >> server, rather than in a small vm alongside the dbd. >> Jobs are still submitting alright, and after adding the cluster back with >> `sacctmgr create cluster $cluster` it seems to have stopped the log firehose. >> The issue I’m mainly seeing now is in the dbd logs: >> >>> [2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >>> [2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to >>> register a cluster ($cluster) with no remote port >> >> And if I run >>> $ sacctmgr show cluster >>> Cluster ControlHost ControlPort RPC Share GrpJobs >>> GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall >>> QOS Def QOS >>> ---------- --------------- ------------ ----- --------- ------- >>> ------------- --------- ------- ------------- --------- ----------- >>> -------------------- --------- >>> $cluster 0 0 1 >>> >>> normal >> >> I can see the ControlHost, ControlPort, and RPC are all missing. >> So I’m not sure what I need to do to figure out how to effectively reset my >> dbd. >> Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf. >> >> The only thing that has changed is the StorageHost in the dbd conf, and I >> made the database, user, and grant all on slurm_acct_db.*, on the new >> database server. >> And I’ve verified that it has made tables, and that I can connect from the >> host with the correct credentials. >> >>> mysql> show tables; >>> +----------------------------------+ >>> | Tables_in_slurm_acct_db | >>> +----------------------------------+ >>> | acct_coord_table | >>> | acct_table | >>> | $cluster_assoc_table | >>> | $cluster_assoc_usage_day_table | >>> | $cluster_assoc_usage_hour_table | >>> | $cluster_assoc_usage_month_table | >>> | $cluster_event_table | >>> | $cluster_job_table | >>> | $cluster_last_ran_table | >>> | $cluster_resv_table | >>> | $cluster_step_table | >>> | $cluster_suspend_table | >>> | $cluster_usage_day_table | >>> | $cluster_usage_hour_table | >>> | $cluster_usage_month_table | >>> | $cluster_wckey_table | >>> | $cluster_wckey_usage_day_table | >>> | $cluster_wckey_usage_hour_table | >>> | $cluster_wckey_usage_month_table | >>> | clus_res_table | >>> | cluster_table | >>> | convert_version_table | >>> | federation_table | >>> | qos_table | >>> | res_table | >>> | table_defs_table | >>> | tres_table | >>> | txn_table | >>> | user_table | >>> +----------------------------------+ >>> 29 rows in set (0.01 sec) >> >> >> Any tips are appreciated. >> >> 21.08.7 and Ubuntu 20.04. >> Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is >> running on another host, and is the primary. >> >> Thanks, >> Reed