Jason Gustafson created KAFKA-7601:
--------------------------------------

             Summary: Handle message format downgrades during upgrade of 
message format version
                 Key: KAFKA-7601
                 URL: https://issues.apache.org/jira/browse/KAFKA-7601
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson


During an upgrade of the message format, there is a short time during which the 
configured message format version is not consistent across all replicas of a 
partition. As long as all brokers are on the same version, this typically does 
not cause any problems. Followers will take whatever message format is used by 
the leader. However, it is possible for leadership to change several times 
between brokers which support the new format and those which support the old 
format. This can cause the version used in the log to flap between the 
different formats until the upgrade is complete. 

For example, suppose broker 1 has been updated to use v2 and broker 2 is still 
on v1. When broker 1 is the leader, all new messages will be written in the v2 
format. When broker 2 is leader, v1 will be used. If there is any instability 
in the cluster or if completion of the update is delayed, then the log will be 
seen to switch back and forth between v1 and v2. Once the update is completed 
and broker 1 begins using v2, then the message format will stabilize and 
everything will generally be ok.

Downgrades of the message format are problematic, even if they are just 
temporary. There are basically two issues:

1. We use the configured message format version to tell whether down-conversion 
is needed. We assume that the this is always the maximum version used in the 
log, but that assumption fails in the case of a downgrade. In the worst case, 
old clients will see the new format and likely fail.

2. The logic we use for finding the truncation offset during the become 
follower transition does not handle flapping between message formats. When the 
new format is used by the leader, then the epoch cache will be updated 
correctly. When the old format is in use, the epoch cache won't be updated. 
This can lead to an incorrect result to OffsetsForLeaderEpoch queries.

For the second point, the specific case we observed is something like this. 
Broker 1 is the leader of epoch 0 and writes some messages to the log using the 
v2 message format. Broker 2 then becomes the leader for epoch 1 and writes some 
messages in the v2 format. On broker 2, the last entry in the epoch cache is 
epoch 0. No entry is written for epoch 1 because it uses the old format. When 
broker 1 became a follower, it send an OffsetsForLeaderEpoch query to broker 2 
for epoch 0. Since epoch 0 was the last entry in the cache, the log end offset 
was returned. This resulted in localized log divergence.

There are a few options to fix this problem. From a high level, we can either 
be stricter about preventing downgrades of the message format, or we can add 
additional logic to make downgrades safe. 

(Disallow downgrades): As an example of the first approach, the leader could 
always use the maximum of the last version written to the log and the 
configured message format version. 

(Allow downgrades): If we want to allow downgrades, then it make makes sense to 
invalidate and remove all entries in the epoch cache following the message 
format downgrade. We would also need a solution for the problem of detecting 
when down-conversion is needed for a fetch request. One option I've been 
thinking about is enforcing the invariant that each segment uses only one 
message format version. Whenever the message format changes, we need to roll a 
new segment. Then we can simply remember which format is in use by each segment 
to tell whether down-conversion is needed for a given fetch request.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to