Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix?

Janis Viklis | Files.fm Wed, 03 Jul 2024 06:36:26 -0700

If I set valid management server id, it returns to NULL after next hostcheck cycle.

I wonder could it bet somehow related to total or cluster resources.(but i tried to find and check/change all overprovisionig multipliers)

2024-07-03 16:30:16,036 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 32 VMs on host 2482024-07-03 16:30:16,039 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 0 VMs areMigrating from host 2482024-07-03 16:30:16,138 ERROR [c.c.a.AlertManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Caught exception inrecalculating capacity

java.lang.NullPointerException

atcom.cloud.capacity.CapacityManagerImpl.updateCapacityForHost(CapacityManagerImpl.java:677) atcom.cloud.alert.AlertManagerImpl.recalculateCapacity(AlertManagerImpl.java:279) atcom.cloud.alert.AlertManagerImpl.checkForAlerts(AlertManagerImpl.java:432) atcom.cloud.alert.AlertManagerImpl$CapacityChecker.runInContext(AlertManagerImpl.java:422) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask$1.runInContext(ManagedContextTimerTask.java:30) atorg.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) atorg.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask.run(ManagedContextTimerTask.java:32)

        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Janis

On 2024-07-03 16:27, Nux wrote:

Hello,

What happens if you update the 4 problematic hosts with a valid mgmt id?

On 2024-07-03 14:23, Janis Viklis | Files.fm wrote:

mgmt_server_id is NULL just for those 4 hosts, other hosts ar fine.
Looking at logs, cs1 management server starts to connect pools at
first:

2024-07-01 16:31:29,617 DEBUG [c.c.s.l.StoragePoolMonitor]
(AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Host 248 connected,
connecting host to shared pool id 152 and sending storage pool...

------------------------------------------------------------------------


DB Tables: cloud.host and cloud.mshost:

SELECT id, status, Type, mgmt_server_id FROM cloud.host  where ID in
(74,77,170, 248, 254, 257, 260) :

         260
         Alert
         Routing

         257
         Alert
         Routing

         254
         Alert
         Routing

         248
         Alert
         Routing

         170
         Up
         Routing
         95534596974

         77
         Up
         Routing
         95534596974

         74
         Up
         Routing
         95534596974

         179
         95534596974
         1720012401793
         localhost
         b34f493a-42c0-47a8-ada4-04be4cdd8c49
         Up
         4.13.1.0
         10.10.10.11
         9090
         2024-07-03 13:13:47

         0

         178
         95536034244
         1718828790629
         cs2.failiem.lv
         70420423-b362-4335-b083-8ad1342ce485
         Down
         4.13.1.0
         10.10.10.12
         9090
         2024-06-19 20:39:19

         1

         176
         95530190206
         1719663483676
         localhost
         96a155b6-7041-48ff-9f20-268ea77c5098
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-29 12:24:28

         1

         175
         95536505104
         1719666507512
         localhost
         c8e6fefa-7464-4bb7-a379-5eafb55c666d
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-29 13:38:00

         0

         174
         95534962877
         1682516323955
         localhost
         45a057c6-6d50-41a9-bbad-cab370c01832
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-15 08:36:06

         1

         172
         95529749065
         1658756353180
         localhost
         535277d3-33df-4b2a-9f1d-07f05084d473
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-15 07:53:32

         1

         170
         95529797928
         1603725530943
         localhost
         5892611f-7af8-4686-8818-95ade086e6cf
         Down
         4.13.1.0
         10.10.10.13
         9090
         2020-11-03 04:05:40

         1

         167
         95534560846
         1658756323907
         localhost
         e7ffd55a-77b7-4848-90de-5b5f10cc4500
         Down
         4.13.1.0
         10.10.10.11
         9090
         2023-04-17 09:50:14

         1

         163
         95534279505
         1582559260879
         cs1.failiem.lv
         8c254697-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.11
         9090
         2020-05-16 14:07:09

         1

         161
         95531601526
         1582559325515
         cs3.failiem.lv
         8c25457e-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.13
         9090
         2020-05-16 14:07:21

         1

Janis

On 2024-07-03 13:11, Nux wrote:

A shot in the dark, haven't checked the log files properly.
For these hosts in the disconnected state, if you check them in the
DB cloud.host table (type="Routing" btw), which mgmt_server_id are
they reporting?

Then check cloud.mshost table and see whether the management server
with that id is in there and marked as UP etc.

HTH

On 2024-07-03 06:57, Janis Viklis | Files.fm wrote:
(sorry, some bad formatting in previous email)

Could anyone have any ideas why this error occurs and how to debug
it? (248 is a host id)

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

Janis

On 2024-07-01 21:44, Janis Viklis | Files.fm wrote:
Hi,

looking for help after 2 weeks:  What could be the reason that
suddenly after restarting the 4.13.1 Management server, all 4 XEN
(xcp-ng 8.1) hosts of one Intel cluster disconnects and goes into
"Alert state" with an error:

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

I can't find the reason for 2 weeks. The other AMD Xenserver 6.5
cluster is working just fine.

Everything seems ok: network is working, I restarted: toolstack,
both system vms (SSVM, consolev), one of the hosts, then removed and
added back.

Previously there were 3 management servers via Haproxy and Galera
Mariadb, I left only one. (tried upgrade to 3.14.1, didn't help). I
can manage hosts via Xencenter. There ar 5 storage pools and 3
secondary.

Thanks, hoping on some clues or directions, Janis.

Below is LOG output:

Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix?

Reply via email to