[ https://issues.apache.org/jira/browse/KAFKA-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Divij Vaidya updated KAFKA-14768: --------------------------------- Labels: needs-kip performance (was: performance) > proposal to reduce the first message's send time cost and max block time for > safety > ------------------------------------------------------------------------------------ > > Key: KAFKA-14768 > URL: https://issues.apache.org/jira/browse/KAFKA-14768 > Project: Kafka > Issue Type: Improvement > Components: clients > Affects Versions: 3.3.1, 3.3.2 > Reporter: fujian > Assignee: hzh0425 > Priority: Major > Labels: needs-kip, performance > > Hi, Team: > > Nice to meet you! > > In our business, we found two types of issue which need to improve: > > *(1) Take much time to send the first message* > Sometimes, we found the users' functional interaction take a lot of time. At > last, we figure out the root cause is that after we complete deploy or > restart the servers. The first message's delivery on each application server > by kafka client will take much time. > So, we try to find one solution to improve it. > > After analyzing the source code about the first time's sending logic. The > time cost is caused by the getting metadata before the sending. The latter's > sending won't take the much time due to the cached metadata. The logic is > right and necessary. Thus, we still want to improve the experience for the > first message's send/user first interaction. > > *(2) can't reduce the send message's block time to wanted value.* > Sometimes our application's thread will block for max.block.ms to send > message. When we try to reduce the max.block.ms to reduce the blocking time. > It can't meet the getting metadata's time requirement sometimes. The root > cause is the configured max.block.ms is shared with "get metadata" operation > and "send message" operation. We can refer to follow tables: > |*where to block* > |*when it is blocked* > |*how long it will be blocked?* > | > |org.apache.kafka.clients.producer.KafkaProducer#waitOnMetadata|the first > request which need to load the metadata from kafka|<max.block.ms| > |org.apache.kafka.clients.producer.internals.RecordAccumulator#append|at peak > time for business, if the network can’t send message in short > time.|<max.block.ms| > > What's the solution for the above two issues: > I think about current logic and figure out followed possible solution: > (1) send one "warmup" message, thus we can't send any fake message. > (2) provide one extra configure time configure which dedicated for getting > metadata. thus it will break the define for the max.block.ms > (3) add one method to call waitOnMetadata with one timeout setting without > using the max.block.ms (PR: [KAFKA-14768: provide new method to warmup first > record's sending and reduce the max.block.ms safely by jiafu1115 · Pull > Request #13320 · apache/kafka > (github.com)|https://github.com/apache/kafka/pull/13320]) > > _note: org.apache.kafka.clients.producer.KafkaProducer#waitOnMetadata_ > ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long > nowMs, long maxWaitMs) > > __ > after the change, we can call it before the service is marked as ready. After > the ready. it won't block to get metadata due to cache. And then we can be > safe to reduce the max.block.ms to a lower value to reduce thread's blocking > time. > > After adopting the solution 3. we solve the above issues. For example, we > reduce the first message's send about 4s seconds. The log can refer to > followed: > _warmup test_topic at phase phase 2: get metadata from mq start_ > _warmup test_topic at phase phase 2: get metadata from mq end consume > *4669ms*_ > And after the change, we reduce the max.block.ms from 10s to 2s without worry > can't get metadata. > > {*}So what's your thought for these two issues and the solution I > proposed{*}. I hope to get your feedback and thought for the issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)