Apologies for not having more concrete information available when I’m replying 
to you, but I figured maybe having a fast hint might be better.

Have a look at how the various daemons communicate with one another. This 
sounds to me like a firewall thing between maybe the SlurmCtld and where the 
SlurmDBD is running right now, or vice-versa or something like that. The 
“scontrol show cluster” thing is a giveaway. That is populated dynamically, not 
pulled from a config file exactly.

I ran into this exact thing years ago, but can’t remember where the firewall 
was the issue.

Sent from my iPhone

On Jun 15, 2022, at 20:12, Reed Dier <reed.d...@focusvq.com> wrote:

 Hoping this is an easy answer.

My mysql instance somehow corrupted itself, and I’m having to purge and start 
over.
This is ok, because the data in there isn’t too valuable, and we aren’t making 
use of associations or anything like that yet (no AccountingStorageEnforce).

That said, I’ve decided to put the dbd’s mysql instance on my main database 
server, rather than in a small vm alongside the dbd.
Jobs are still submitting alright, and after adding the cluster back with 
`sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
The issue I’m mainly seeing now is in the dbd logs:

[2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port
[2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to register a 
cluster ($cluster) with no remote port

And if I run
$ sacctmgr show cluster
    Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES 
GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   
Def QOS
 ---------- --------------- ------------ ----- --------- ------- ------------- 
--------- ------- ------------- --------- ----------- -------------------- 
---------
   $cluster                            0     0         1                        
                                                                   normal

I can see the ControlHost, ControlPort, and RPC are all missing.
So I’m not sure what I need to do to figure out how to effectively reset my dbd.
Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.

The only thing that has changed is the StorageHost in the dbd conf, and I made 
the database, user, and grant all on slurm_acct_db.*, on the new database 
server.
And I’ve verified that it has made tables, and that I can connect from the host 
with the correct credentials.

mysql> show tables;
+----------------------------------+
| Tables_in_slurm_acct_db          |
+----------------------------------+
| acct_coord_table                 |
| acct_table                       |
| $cluster_assoc_table             |
| $cluster_assoc_usage_day_table   |
| $cluster_assoc_usage_hour_table  |
| $cluster_assoc_usage_month_table |
| $cluster_event_table             |
| $cluster_job_table               |
| $cluster_last_ran_table          |
| $cluster_resv_table              |
| $cluster_step_table              |
| $cluster_suspend_table           |
| $cluster_usage_day_table         |
| $cluster_usage_hour_table        |
| $cluster_usage_month_table       |
| $cluster_wckey_table             |
| $cluster_wckey_usage_day_table   |
| $cluster_wckey_usage_hour_table  |
| $cluster_wckey_usage_month_table |
| clus_res_table                   |
| cluster_table                    |
| convert_version_table            |
| federation_table                 |
| qos_table                        |
| res_table                        |
| table_defs_table                 |
| tres_table                       |
| txn_table                        |
| user_table                       |
+----------------------------------+
29 rows in set (0.01 sec)

Any tips are appreciated.

21.08.7 and Ubuntu 20.04.
Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is running 
on another host, and is the primary.

Thanks,
Reed

Reply via email to