I've had a look at this series without looking at the second series that
extends this (yet). As Dano noted, there are multiple TODOs for the
final ha-manager patch in this series that tries to persist the
migration from groups to rules. During my testing I quickly ran into
some problems with this patch. I noted some of the errors I encountered
and how they could be fixed in the respective patch.

With that said, I tested the following on a 3-node cluster:
- Migration of resource in when node with higher priority becomes
available (tested also that no migration occurs if nofailback is active)
- Stopping node the resource is on from outside correctly triggers
recovery to another node
- Resource is stopped if none of the nodes in the group restricted group
is available
- Same case as before, but node keeps running on node outside of the
group in case of unrestricted group
- Resource is migrated back once a node from its group comes online
again if nofailback is not checked
- Resource stays on current node even if its preferred node comes online
again if nofailback is checked

The in-memory migration seemed to work well, I did not really notice any
problems there.

The persistent migration failed due to the errors describes in
ha-manager 15/15. I applied the changes I proposed in my comments to
that patch and tested that the migration produced a correct rules.cfg. I
tested the various possible group configurations and checked that the
resulting rules.cfg contained the correct nodes and resources.

I noticed the following regarding persistent migration:
- The resources.cfg is empty after the migration. I have not yet been
able to find out why, but I'll try to send a fix for that.
- The nofailback flag is (due to above) not correctly migrated

The migrations based on the persisted rules seemed to work (apart from
failback). Of course, without the second series, the current state is
that the groups are migrated and then are simply gone and there is
nothing to replace them, so it's not usable with the persistent
migration. I'll have a look at the second series as well and at the
tests for this one.

On 7/4/25 20:16, Daniel Kral wrote:
> RFC v1: 
> https://lore.proxmox.com/pve-devel/20250325151254.193177-1-d.k...@proxmox.com/
> RFC v2: 
> https://lore.proxmox.com/pve-devel/20250620143148.218469-1-d.k...@proxmox.com/
> 
> I've separated the core HA Rules module and the transformation from HA
> groups to HA Node Affinity rules (formerly known as HA Location rules)
> in this patch series, to reduce the overhead for reviewers and strive
> for a better version history, as changing two things at a time is rather
> confusing.
> 
> The main things that have changed since the last version (v2):
> 
> - split up the patch series (ofc)
> 
> - rebased on newest available master
> 
> - renamed "HA Location Rule" to "HA Node Affinity Rule"
> 
> - renamed any reference of a 'HA service' to 'HA resource' (e.g. rules
>   property 'services' is now 'resources')
> 
> - converted tri-state property 'state' to a binary 'disable' flag on HA
>   rules and expose the 'contradictory' state with an 'errors' hash
> 
> - remove the "use-location-rules" feature flag and implement a more
>   straightforward ha groups migration (more on that below)
> 
> - remove any reference of ha groups from the web interface
> 
> 
> 
> As before, HA groups are migrated to HA node affinity rules in each HA
> Manager round where something has changed about the HA groups / HA
> resources config file, but these are now unconditionally done as soon as
> a HA Manager runs with that version. It will also try to persistently
> migrate these, but that will only be successful as soon as all other
> nodes are upgraded (i.e. every node can run at least the HA Manager
> version that can successfully parse and apply the HA rules).
> 
> 
> There are still some things left to do, which I didn't get the time to
> come around to do for this revision:
> 
> - Testing, testing, testing
> 
> - I've ran out of time on the persistent HA groups migration part, which
>   has at least the two TODOs, which are mentioned in the patch itself,
>   and I haven't tested them on any real PVE upgrade yet; It's more of a
>   draft on how the migration should potentially work
> 
> - Also, the last patch for the persistent HA groups migration part will
>   fail the tests but the two that have been added, because of the way
>   the other tests are designed; that should be abstracted away in the HA
>   environment, e.g., a routine "have_groups_been_migrated" for PVE2/Sim.
> 
> - There might be a bit too many in-memory group migrations on the HA
>   Rules API side now, but better safe then sorry, maybe they can be
>   removed later; however, these shouldn't overwrite the rules that come
>   from the config, I haven't checked on that yet
> 
> - Should the HA Groups API (and the HA Resources 'group' property in the
>   HA Resources API) be removed now? Or should these stay and uses of
>   them make auto-migrations to the HA Rules?
> 
> As in the previous revisions, I've run a
> 
>     git rebase master --exec 'make clean && make deb'
> 
> on the series, so the tests should work for every patch.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to