Re: Host Disconnect after upgrade 4.20.1

Levin Ng Mon, 30 Jun 2025 21:29:41 -0700

Hi Wei,

Those two VMs appear to be quite typical. I traced back the SQL query logs and 
found that the problematic host is connecting to the secondary management 
server, while others are connecting to the primary server. I suspect that these 
duplicate records are due to database replication while comparing the HA 
database tables, which have different counts. I didn’t check the database 
replication at the beginning because there were no replication errors since the 
HA database was established. I believe it’s time to explore alternative HA 
database options.


I have no knowledge of how ACS management selects the database connection. It 
could happen anytime if the replication breaks again. My configuration is quite 
typical. There are two management servers, each with a pair of MySQL 
bidirectional replication. The  ACS database is set to DB1, and the 
database.cloud.replicas are set to DB2 and vice versa on the secondary 
management server.

 I can propose using INSERT IGNORE or modify the table with fk + constraints, 
what do you think?


Thank you again.
Regards,
Levin
On 1 Jul 2025 at 03:13 +0800, Wei ZHOU <ustcweiz...@gmail.com>, wrote:
> Hi Levin,
>
> Thanks for the update.
>
> Can you share more information about the two VMs ?
>
> Kind regards,
> Wei
>
> On Mon, Jun 30, 2025 at 7:57 PM Levin Ng <levindec...@gmail.com> wrote:
>
> > Hi Wei,
> >
> > I’ve finally identified two VMs that are constantly causing the CPU
> > overcommit ratio to be recreated, which prevents the host from rejoining
> > the management server. I deleted the offending VMs and recreated them from
> > a template.
> >
> > Regards,
> > Levin
> > On 1 Jul 2025 at 01:30 +0800, Levin Ng <levindec...@gmail.com>, wrote:
> > > Hi Wei,
> > >
> > > Today after restart management server, I got same error for the same
> > host rejoined last time, do you have any hint?
> > >
> > > 2025-07-01 01:11:19,119 ERROR [c.c.a.m.ClusteredAgentManagerImpl]
> > (AgentConnectTaskPool-126:[ctx-ef0bee48]) (logid:08898435) Monitor
> > ComputeCapacityListener says there is an error in the connect process for
> > 125 due to Duplicate key cpuOvercommitRatio (attempted merging values 12
> > and 12) java.lang.IllegalStateException: Duplicate key cpuOvercommitRatio
> > (attempted merging values 12 and 12)
> > >
> > > Regards,
> > > Levin
> > >
> > > On 28 Jun 2025 at 16:49 +0800, Levin Ng <levindec...@gmail.com>, wrote:
> > > > Hi Wei,
> > > >
> > > > I did search the user_vm_details and vm_instance tables with the
> > host_id, but I couldn’t find any duplicate records. I just shut down the
> > running VMs on those hosts, removed the hosts, and let the agent re-join
> > the ACS. The problem is gone, thanks to your help again! It’s been really
> > frustrating with the recent ACS upgrade.
> > > >
> > > > Regards,
> > > > Levin
> > > > On 28 Jun 2025 at 16:34 +0800, Wei ZHOU <ustcweiz...@gmail.com>,
> > wrote:
> > > > > can you also check user_vm_details for the VMs running on the host ?
> > > > >
> > > > >
> > > > > -Wei
> > > > >
> > > > > On Sat, Jun 28, 2025 at 10:04 AM Levin Ng <levindec...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Wei,
> > > > > >
> > > > > > Thanks again, from the problematic cluster_id 7, it just contains
> > one
> > > > > > cpuOvercommitRatio row, any idea?
> > > > > >
> > > > > > Regads,
> > > > > > Levin
> > > > > >
> > > > > > MariaDB [cloud]> select * from cluster_details;
> > > > > > +----+------------+-----------------------+-------+
> > > > > > | id | cluster_id | name | value |
> > > > > > +----+------------+-----------------------+-------+
> > > > > > | 1 | 1 | memoryOvercommitRatio | 1.0 |
> > > > > > | 2 | 1 | cpuOvercommitRatio | 1.0 |
> > > > > > | 3 | 2 | memoryOvercommitRatio | 1.0 |
> > > > > > | 4 | 2 | cpuOvercommitRatio | 1.0 |
> > > > > > | 5 | 3 | memoryOvercommitRatio | 1.0 |
> > > > > > | 6 | 3 | cpuOvercommitRatio | 1.0 |
> > > > > > | 7 | 4 | memoryOvercommitRatio | 1.0 |
> > > > > > | 8 | 4 | cpuOvercommitRatio | 1.0 |
> > > > > > | 9 | 5 | memoryOvercommitRatio | 1.0 |
> > > > > > | 10 | 5 | cpuOvercommitRatio | 1.0 |
> > > > > > | 11 | 6 | memoryOvercommitRatio | 1.0 |
> > > > > > | 12 | 6 | cpuOvercommitRatio | 1.0 |
> > > > > > | 13 | 7 | memoryOvercommitRatio | 1.0 |
> > > > > > | 14 | 7 | cpuOvercommitRatio | 12 |
> > > > > > | 15 | 7 | resourceHAEnabled | false |
> > > > > > | 16 | 8 | memoryOvercommitRatio | 1.3 |
> > > > > > | 17 | 8 | cpuOvercommitRatio | 15.0 |
> > > > > > | 18 | 9 | memoryOvercommitRatio | 1.3 |
> > > > > > | 19 | 9 | cpuOvercommitRatio | 15.0 |
> > > > > > | 20 | 10 | memoryOvercommitRatio | 1.3 |
> > > > > > | 21 | 10 | cpuOvercommitRatio | 15.0 |
> > > > > > | 22 | 11 | memoryOvercommitRatio | 1.0 |
> > > > > > | 23 | 11 | cpuOvercommitRatio | 12.0 |
> > > > > > +----+------------+-----------------------+-------+
> > > > > > 23 rows in set (0.001 sec)
> > > > > >
> > > > > > MariaDB [cloud]> desc cluster_details;
> > > > > >
> > > > > >
> > +------------+---------------------+------+-----+---------+----------------+
> > > > > > | Field | Type | Null | Key | Default | Extra |
> > > > > >
> > > > > >
> > +------------+---------------------+------+-----+---------+----------------+
> > > > > > | id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
> > > > > > | cluster_id | bigint(20) unsigned | NO | MUL | NULL | |
> > > > > > | name | varchar(255) | NO | MUL | NULL | |
> > > > > > | value | varchar(255) | NO | | NULL | |
> > > > > >
> > > > > >
> > +------------+---------------------+------+-----+---------+----------------+
> > > > > > 4 rows in set (0.005 sec)
> > > > > >
> > > > > > On 28 Jun 2025 at 15:54 +0800, Wei ZHOU <ustcweiz...@gmail.com>,
> > wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Maybe check cluster_details if there are multiple records with
> > the same
> > > > > > > name "cpuOvercommitRatio" for a cluster.
> > > > > > >
> > > > > > >
> > > > > > > -Wei
> > > > > > >
> > > > > > > On Sat, Jun 28, 2025 at 9:37 AM Levin Ng <levindec...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I’m having trouble after 4.20.1 upgrade, some of the existing
> > host are
> > > > > > not
> > > > > > > > able to reconnect ACS management and found some sql error in
> > the log,
> > > > > > > > anyone have idea how to resolve it?, thank you very much.
> > > > > > > >
> > > > > > > > 2025-06-28 15:30:49,259 ERROR
> > [c.c.a.m.ClusteredAgentManagerImpl]
> > > > > > > > (AgentConnectTaskPool-1092:[ctx-99bfb3dd]) (logid:b354f521)
> > Monitor
> > > > > > > > ComputeCapacityListener says there is an error in the connect
> > process
> > > > > > for
> > > > > > > > 110 due to Duplicate key cpuOvercommitRatio (attempted merging
> > values
> > > > > > 12
> > > > > > > > and 12) java.lang.IllegalStateException: Duplicate key
> > > > > > cpuOvercommitRatio
> > > > > > > > (attempted merging values 12 and 12)
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Levin
> > > > > > > >
> > > > > >
> >

Re: Host Disconnect after upgrade 4.20.1

Reply via email to