Re: [slurm-users] DBD Reset

Reed Dier Wed, 15 Jun 2022 18:37:31 -0700

Well, you nailed it.

Honestly a little surprised it was working to begin with.


In the DBD conf
> -#DbdPort=7031
> +DbdPort=7031

And then in the slurm.conf
> -#AccountingStoragePort=3306
> +AccountingStoragePort=7031

I’m not sure how my slurm.conf showed the 3306 mysql port commented out.
I did confirm that the slurmdbd was listening on 6819 before,  so I assumed 
that the default would be 6819 on the dbd and the “client” (ctld or otherwise) 
side, but somehow that wasn’t the case?

Either way, I do feel like things are getting back to the right state.
So thank you so much for pointing me in the correct direction.

Thanks,
Reed 

> On Jun 15, 2022, at 7:50 PM, Ryan Novosielski <novos...@rutgers.edu> wrote:
> 
> Apologies for not having more concrete information available when I’m 
> replying to you, but I figured maybe having a fast hint might be better. 
> 
> Have a look at how the various daemons communicate with one another. This 
> sounds to me like a firewall thing between maybe the SlurmCtld and where the 
> SlurmDBD is running right now, or vice-versa or something like that. The 
> “scontrol show cluster” thing is a giveaway. That is populated dynamically, 
> not pulled from a config file exactly. 
> 
> I ran into this exact thing years ago, but can’t remember where the firewall 
> was the issue. 
> 
> Sent from my iPhone
> 
>> On Jun 15, 2022, at 20:12, Reed Dier <reed.d...@focusvq.com> wrote:
>> 
>>  Hoping this is an easy answer.
>> 
>> My mysql instance somehow corrupted itself, and I’m having to purge and 
>> start over.
>> This is ok, because the data in there isn’t too valuable, and we aren’t 
>> making use of associations or anything like that yet (no 
>> AccountingStorageEnforce).
>> 
>> That said, I’ve decided to put the dbd’s mysql instance on my main database 
>> server, rather than in a small vm alongside the dbd.
>> Jobs are still submitting alright, and after adding the cluster back with 
>> `sacctmgr create cluster $cluster` it seems to have stopped the log firehose.
>> The issue I’m mainly seeing now is in the dbd logs:
>> 
>>> [2022-06-15T19:40:43.064] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:40:43.065] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:45:39.827] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:48:01.038] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:48:01.039] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:48:38.104] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:50:39.290] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>>> [2022-06-15T19:55:39.769] error: _add_registered_cluster: trying to 
>>> register a cluster ($cluster) with no remote port
>> 
>> And if I run 
>>> $ sacctmgr show cluster
>>>     Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       
>>> GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall               
>>>    QOS   Def QOS
>>>  ---------- --------------- ------------ ----- --------- ------- 
>>> ------------- --------- ------- ------------- --------- ----------- 
>>> -------------------- ---------
>>>    $cluster                            0     0         1                    
>>>                                                                        
>>> normal
>> 
>> I can see the ControlHost, ControlPort, and RPC are all missing.
>> So I’m not sure what I need to do to figure out how to effectively reset my 
>> dbd.
>> Also, $cluster in sacctmgr matches ClusterName=$cluster in my slurm.conf.
>> 
>> The only thing that has changed is the StorageHost in the dbd conf, and I 
>> made the database, user, and grant all on slurm_acct_db.*, on the new 
>> database server.
>> And I’ve verified that it has made tables, and that I can connect from the 
>> host with the correct credentials.
>> 
>>> mysql> show tables;
>>> +----------------------------------+
>>> | Tables_in_slurm_acct_db          |
>>> +----------------------------------+
>>> | acct_coord_table                 |
>>> | acct_table                       |
>>> | $cluster_assoc_table             |
>>> | $cluster_assoc_usage_day_table   |
>>> | $cluster_assoc_usage_hour_table  |
>>> | $cluster_assoc_usage_month_table |
>>> | $cluster_event_table             |
>>> | $cluster_job_table               |
>>> | $cluster_last_ran_table          |
>>> | $cluster_resv_table              |
>>> | $cluster_step_table              |
>>> | $cluster_suspend_table           |
>>> | $cluster_usage_day_table         |
>>> | $cluster_usage_hour_table        |
>>> | $cluster_usage_month_table       |
>>> | $cluster_wckey_table             |
>>> | $cluster_wckey_usage_day_table   |
>>> | $cluster_wckey_usage_hour_table  |
>>> | $cluster_wckey_usage_month_table |
>>> | clus_res_table                   |
>>> | cluster_table                    |
>>> | convert_version_table            |
>>> | federation_table                 |
>>> | qos_table                        |
>>> | res_table                        |
>>> | table_defs_table                 |
>>> | tres_table                       |
>>> | txn_table                        |
>>> | user_table                       |
>>> +----------------------------------+
>>> 29 rows in set (0.01 sec)
>> 
>> 
>> Any tips are appreciated.
>> 
>> 21.08.7 and Ubuntu 20.04.
>> Slurmdbd and slurmctld(1) are running on one host, and slurmctld(2) is 
>> running on another host, and is the primary.
>> 
>> Thanks,
>> Reed

Re: [slurm-users] DBD Reset

Reply via email to