[ 
https://issues.apache.org/jira/browse/KAFKA-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fujian updated KAFKA-14768:
---------------------------
    Component/s: clients
                     (was: core)

> proposal to reduce the first message's send time cost and max block time for 
> safety 
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-14768
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14768
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients
>    Affects Versions: 3.3.1
>            Reporter: fujian
>            Assignee: hzh0425
>            Priority: Minor
>
> Hi, Team:
>  
> Nice to meet you!
>  
> In our business, we found two types of issue which need to improve:
>  
> *(1) Take much time to send the first message*
> Sometimes, we found the users' functional interaction take a lot of time. At 
> last, we figure out the root cause is that after we complete deploy or 
> restart the servers. The first message's delivery on each application server 
> by kafka client will take much time.
> So, we try to find one solution to improve it. 
>  
> After analyzing the source code about the first time's sending logic. The 
> time cost is caused by the getting metadata before the sending. The latter's 
> sending won't take the much time due to the cached metadata. The logic is 
> right and necessary. Thus, we still want to improve the experience for the 
> first message's send/user first interaction. 
>  
> *(2) can't reduce the send message's block time to wanted value.*
> Sometimes our application's thread will block for max.block.ms to send 
> message. When we try to reduce the max.block.ms to reduce the blocking time. 
> It can't meet the getting metadata's time requirement sometimes. The root 
> cause is the configured max.block.ms is shared with "get metadata" operation 
> and "send message" operation. We can refer to follow tables:
> |*where to block*
>  |*when it is blocked*
>  |*how long it will be blocked?*
>  |
> |org.apache.kafka.clients.producer.KafkaProducer#waitOnMetadata|the first 
> request which need to load the metadata from kafka|<max.block.ms|
> |org.apache.kafka.clients.producer.internals.RecordAccumulator#append|at peak 
> time for business, if the network can’t send message in short 
> time.|<max.block.ms|
>  
> What's the solution for the above two issues:
> I think about current logic and figure out followed possible solution:
> (1) send one "warmup" message, thus we can't send any fake message.
> (2) provide one extra configure time configure which dedicated for getting 
> metadata. thus it will break the define for the max.block.ms
> (3) change the private to public for the method or provide dedicated method 
> for this support. 
> _private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, 
> long nowMs, long maxWaitMs)_
>  
> so that we can call it before the service is marked as ready. After the 
> ready. it won't block to get metadata due to cache. And then we can reduce 
> the max.block.ms to a lower value to reduce thread's blocking time for thread.
>  
> After adopt the solution 3. we solve the above issues. For example, we reduce 
> the first message's send about 4s seconds. The log can refer to followed:
> _warmup test_topic at phase phase 2: get metadata from mq start_
> _warmup test_topic at phase phase 2: get metadata from mq end consume 
> *4669ms*_
> And after the change, we reduce the max.block.ms from 10s to 2s without worry 
> can't get metadata.
>  
> {*}So what's your thought for these two issues and the solution I 
> proposed{*}. If there is no problem for it. I can create one PR to merge. I 
> hope to get your feedback and thought for the issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to