Kea HA detects partner failures via the control channel in the first
place. The servers constantly exchange heartbeats and lease updates. If
this communication is healthy (i.e., servers receive responses to the
control commands), the counters of unacked clients are set to 0, and the
servers do not monitor whether their partners respond to DHCP. In other
words, if the servers can communicate with each other, the
"max-unacked-clients" setting has no effect.
Partner failure detection in Kea HA always begins with communication
failure over the control channel. Usually, it is caused by a partner
process crash or a network failure. The server tries to send heartbeats
and lease updates to the partner, for which it gets no responses.
Suppose the "max-response-delay" setting is 60000 (1 minute) and the
"heartbeat-delay" is 10000 (10 seconds). In that case, the server sends
a heartbeat every 10 seconds to the partner. If the partner doesn't
respond, the server sends the heartbeat again 10 seconds later, and so
on. It neither counts the unacked clients nor transitions to the
partner-down state because the communication issue can be temporary.
Finally, after around six unsuccessful heartbeat attempts (6 * 10
seconds), the communication interruption becomes longer than 1 minute
(60000ms). In that case, the server assumes there can be an issue with
the partner. It is when the "max-unacked-clients" setting finally starts
to matter.
The server begins to analyze the DHCP messages sent to the partner
server. The "secs" field in DHCPv4 and "Elapsed Time Option" in DHCPv6
should be set by a client to indicate how long the client has been
trying to ask for a new lease or rebind an existing lease. Obviously,
the server can't see if the partner responds to these queries. It only
gets copies of the DHCP messages sent by the client. The clients must
bump these values when they retry to obtain a lease. If these values are
zero, the server suspecting that the partner is down cannot determine
whether the partner actually responds to DHCP. If these values are
greater than 0 (and greater than "max-ack-delay") the server can assume
that the partner hasn't responded to them because the clients are retrying.
For every client who sends a DHCPDISCOVER or DHCPREBIND to the partner
server and (finally) sets the "secs" field value greater than
"max-ack-delay", the other server bumps up its internal counter or
unacked clients. Again, it only does it when it has been unable to
communicate with the partner server over the control channel longer than
the configured "max-response-delay". A single successful heartbeat over
the control channel will clear the counters of the unacked clients and
make the server believe that the partner is healthy. It will also stop
looking at the "secs" and "Elapsed Time" values. The
"max-unacked-clients" no longer matters until the next communication
issue over the control channel.
If the "max-unacked-clients" value is exceeded, the server can finally
transition to the partner-down state and handle both the traffic
directed to itself and the inoperational partner. Since the state
transitions are only carried after a heartbeat attempt, there may be a
slight delay between exceeding the "max-unacked-clients" value and
actually transitioning to the partner-down state.
I looked into our documentation and realized that although all of these
pieces are described there, it can be confusing because the ARM lacks a
sequence diagram or an example of how the failover process can look end
to end. That's something we should address.
Going over the previous emails, I see that users can see different
failover strategies, depending on the types of failures they are likely
to experience in their setup. They are interesting cases, and we will
discuss them internally. We could consider some alternative failure
detection strategies, selectable with the HA configuration, but we
should be aware that there is no one-fits-all solution. There is always
a possibility that the true failure won't be detected or a false failure
will be detected, leading to a split-brain situation.
It would be useful if you could please open tickets in Gitlab to
describe your failover scenarios and the desired behavior. Please
disregard it if you have already opened them.
Kinds Regards,
Marcin Siodelski
Sr. Software Engineer,
ISC
On 10.01.2023 03:07, Eric Graham wrote:
This is my understanding of how the unacked clients functionality works.
My explanation is based upon the DHCP4 source code and may differ for
DHCP6. I will include references at the bottom of my email which I
encourage double-checking for accuracy. I am not a contributor to Kea
and have not thoroughly tested the conclusions I draw here.
1. The DHCP packet enters Kea. The HA hook receives the packet in the
buffer4Receive[1] function. The packet contents are parsed and dropped
if invalid.
2. The packet is checked to be in scope [2][3][4][5][6] (and if it
isn't, the packet status is set to NEXT_STEP_DROP [21]). Whether a
packet is in scope is decided by the following:
a. If the packet is not one that can be handled by HA (is one of
DHCPDISCOVER, DHCPREQUEST, DHCPDECLINE, DHCPRELEASE, or DHCPINFORM [7]),
then the current server will process it [8].
b. If HA is configured in load balancing mode, the packet is classed
according to the aforementioned HBA defined in RFC 3074 section 6
[9][10]. The HBA returns the server that must handle the packet (either
primary or secondary). Otherwise (server is in hot standby), the packet
is classed as belonging to the primary server in the HA configuration
[12]. The class given in either of these conditions is the defined name
of the respective server, coming from the HA section of the Kea DHCP4
configuration [13].
c. The current server will process the packet if it is serving
packets with the class determined in (2)(b).
Note: every heartbeat, the servers send each other their scopes [15]. A
failed heartbeat sets the HA status to "unavailable" [24], which
eventually transitions the server to partner down state.
3. If the server is in a communication interrupted state and the packet
is not classed for the current server, then:
a. Maintain a global counter, incrementing it once per packet (every
successful heartbeat counts as a "poke" for the partner [16], which
resets this global counter to zero [17]).
b. Get the "secs" field of the packet. Compare the value to the value
configured in the Kea DHCP4 configuration for "max-ack-delay" [18], or
10 seconds by default [19]. If the value of this field is greater than
the max-ack-delay, the packet is considered unacked [20]. All packets
(unacked or not) are kept track of in a map containing the hardware
address, client ID, and last unacked status; if the packet is being
received unacked, and it has not been previously recorded as being
unacked (that is, the packet secs field just exceeded the max-ack-delay
threshold for the first time), the server logs a warning message.
4. A failure is detected if the number of packets in the unacked state
is greater than the "max-unacked-clients" setting of the Kea DHCP4
config [22] (or 10 by default [19]). If a failure is detected, the
server eventually transitions to partner-down state [23]. More
information about when exactly the server transitions to partner-down
state is shown by the usages of HAService::shouldPartnerDown() [25] (in
other words, I'm not digging into that tonight).
[1]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L60-L111
[2]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1021
[3]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1029-L1047
[4]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1034
[5]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L376
[6]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L382-L414
[7]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L51-L71
[8]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L395
[9]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L416-L446
[10]: https://www.rfc-editor.org/rfc/rfc3074
[11]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L413
[12]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/query_filter.cc#L398
[13]:
https://kea.readthedocs.io/en/kea-2.2.0/arm/hooks.html#load-balancing-configuration
[14]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L617-L625
[15]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1757-L1758
[16]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1793-L1794
[17]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L274
[18]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L180-L181
[19]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config.cc#L166
[20]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/communication_state.cc#L652
[21]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_impl.cc#L104
[22]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_config_parser.cc#L184-L185
[23]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1097
[24]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1799
[25]:
https://gitlab.isc.org/isc-projects/kea/-/blob/c4c53a0168ffa385c387ba685ac16e5544feaad4/src/hooks/dhcp/high_availability/ha_service.cc#L1081-L1106
*Eric Graham*
/DevOps Specialist/
Direct: 605.990.1859/
/
//[email protected] <mailto:[email protected]>/
/
/
/
------------------------------------------------------------------------
*From:* Kea-users <[email protected]> on behalf of Kevin
P. Fleming <[email protected]>
*Sent:* Monday, January 9, 2023 12:38 PM
*To:* [email protected] <[email protected]>
*Subject:* Re: [Kea-users] Load-Balancing Network issue between Relay
and Kea
*CAUTION:* This email originated outside the organization. Do not click
any links or attachments unless you have verified the sender.
On Mon, Jan 9, 2023, at 11:54, Veronique Lefebure wrote:
Very interesting thread.
Mathias, you wrote "Expected behaviour: Kea 2 sees the unacked clients
of Kea 1 and sets Kea 1 in partner-down state and handles all
requests.", but, If there is no traffic between DHCP clients and Kea1,
then the value of max-unacked-clients on server1 cannot increase
anyway, right ? In other words, Kea2 cannot "see" anything ?
It can 'see', because it *also* saw all of the client requests and knows
which ones it expected to be handled by Kea1 (as noted earlier in the
thread it even emits a log message indicating this).
Forgive my presumption, but I assumed that 'max-unacked-clients' would
be a counter of 'unacked clients' which belong to a Kea server *other
than this one*. I don't immediately know how counting the number of
clients *this server* has not acked would be useful, although I won't be
surprised to learn that it is useful to someone.
--
ISC funds the development of this software with paid support subscriptions.
Contact us at https://www.isc.org/contact/ for more information.
To unsubscribe visit https://lists.isc.org/mailman/listinfo/kea-users.
Kea-users mailing list
[email protected]
https://lists.isc.org/mailman/listinfo/kea-users