[ https://issues.apache.org/jira/browse/KAFKA-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557576#comment-17557576 ]
Guozhang Wang commented on KAFKA-12478: --------------------------------------- Thanks [~hudeqi], I will take a look at the KIP. > Consumer group may lose data for newly expanded partitions when add > partitions for topic if the group is set to consume from the latest > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-12478 > URL: https://issues.apache.org/jira/browse/KAFKA-12478 > Project: Kafka > Issue Type: Improvement > Components: clients > Affects Versions: 3.1.1 > Reporter: hudeqi > Priority: Blocker > Labels: kip-842 > Attachments: safe-console-consumer.png, safe-consume.png, > safe-produce.png, trunk-console-consumer.png, trunk-consume.png, > trunk-produce.png > > Original Estimate: 1,158h > Remaining Estimate: 1,158h > > This problem is exposed in our product environment: a topic is used to > produce monitoring data. *After expanding partitions, the consumer side of > the business reported that the data is lost.* > After preliminary investigation, the lost data is all concentrated in the > newly expanded partitions. The reason is: when the server expands, the > producer firstly perceives the expansion, and some data is written in the > newly expanded partitions. But the consumer group perceives the expansion > later, after the rebalance is completed, the newly expanded partitions will > be consumed from the latest if it is set to consume from the latest. Within a > period of time, the data of the newly expanded partitions is skipped and lost > by the consumer. > If it is not necessarily set to consume from the earliest for a huge data > flow topic when starts up, this will make the group consume historical data > from the broker crazily, which will affect the performance of brokers to a > certain extent. Therefore, *it is necessary to consume these partitions from > the earliest separately.* > > I did a test and the result is as attached screenshot. Firstly, set by > producer and consumer "metadata.max.age.ms" are 500ms and 30000ms > respectively. > _trunk-console-consumer.png_ means to use the community version to start the > consumer and set "latest". > _trunk-produce.png_ means the data produced, "partition_count" means the > number of partitions of the current topic, "message" means the digital > content of the corresponding message, "send_to_partition_index" Indicates the > index of the partition to which the corresponding message is sent. It can be > seen that at 11:32:10, the producer perceives the expansion of the total > partitions from 2 to 3, and writes the numbers 38, 41, and 44 into the newly > expanded partition 2. > _trunk-consume.png_ represents all the digital content consumed by the > community version. You can see that 38 and 41 sent to partition 2 were not > consumed at the beginning. Finally, after partition 2 was perceived, 38 and > 41 were still not consumed. Instead, it has been consumed from the latest 44, > so the two data of 38 and 41 are discarded. > > _safe-console-consumer.png_ means to use the fixed version to start the > consumer and set "safe_latest". > _safe-produce.png_ means the data produced. It can be seen that at 12:12:09, > the producer perceives the expansion of the total partitions from 4 to 5, and > writes the numbers 109 and 114 into the newly expanded partition 4. > _safe-consume.png_ represents all the digital content consumed by the fixed > version. You can see that 109 sent to partition 4 were not consumed at the > beginning. Finally, after partition 4 was perceived,109 was consumed as the > first data of partition 4. So the fixed version will not cause consumption to > lose data under this condition. -- This message was sent by Atlassian Jira (v8.20.7#820007)