[ https://issues.apache.org/jira/browse/KAFKA-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
fujian updated KAFKA-14768: --------------------------- Component/s: clients (was: core) > proposal to reduce the first message's send time cost and max block time for > safety > ------------------------------------------------------------------------------------ > > Key: KAFKA-14768 > URL: https://issues.apache.org/jira/browse/KAFKA-14768 > Project: Kafka > Issue Type: Improvement > Components: clients > Affects Versions: 3.3.1 > Reporter: fujian > Assignee: hzh0425 > Priority: Minor > > Hi, Team: > > Nice to meet you! > > In our business, we found two types of issue which need to improve: > > *(1) Take much time to send the first message* > Sometimes, we found the users' functional interaction take a lot of time. At > last, we figure out the root cause is that after we complete deploy or > restart the servers. The first message's delivery on each application server > by kafka client will take much time. > So, we try to find one solution to improve it. > > After analyzing the source code about the first time's sending logic. The > time cost is caused by the getting metadata before the sending. The latter's > sending won't take the much time due to the cached metadata. The logic is > right and necessary. Thus, we still want to improve the experience for the > first message's send/user first interaction. > > *(2) can't reduce the send message's block time to wanted value.* > Sometimes our application's thread will block for max.block.ms to send > message. When we try to reduce the max.block.ms to reduce the blocking time. > It can't meet the getting metadata's time requirement sometimes. The root > cause is the configured max.block.ms is shared with "get metadata" operation > and "send message" operation. We can refer to follow tables: > |*where to block* > |*when it is blocked* > |*how long it will be blocked?* > | > |org.apache.kafka.clients.producer.KafkaProducer#waitOnMetadata|the first > request which need to load the metadata from kafka|<max.block.ms| > |org.apache.kafka.clients.producer.internals.RecordAccumulator#append|at peak > time for business, if the network can’t send message in short > time.|<max.block.ms| > > What's the solution for the above two issues: > I think about current logic and figure out followed possible solution: > (1) send one "warmup" message, thus we can't send any fake message. > (2) provide one extra configure time configure which dedicated for getting > metadata. thus it will break the define for the max.block.ms > (3) change the private to public for the method or provide dedicated method > for this support. > _private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, > long nowMs, long maxWaitMs)_ > > so that we can call it before the service is marked as ready. After the > ready. it won't block to get metadata due to cache. And then we can reduce > the max.block.ms to a lower value to reduce thread's blocking time for thread. > > After adopt the solution 3. we solve the above issues. For example, we reduce > the first message's send about 4s seconds. The log can refer to followed: > _warmup test_topic at phase phase 2: get metadata from mq start_ > _warmup test_topic at phase phase 2: get metadata from mq end consume > *4669ms*_ > And after the change, we reduce the max.block.ms from 10s to 2s without worry > can't get metadata. > > {*}So what's your thought for these two issues and the solution I > proposed{*}. If there is no problem for it. I can create one PR to merge. I > hope to get your feedback and thought for the issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)