[ 
https://issues.apache.org/jira/browse/KAFKA-16670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844063#comment-17844063
 ] 

sanghyeok An commented on KAFKA-16670:
--------------------------------------

Hi, [~lianetm]! thanks for your comments. your description is clear, and i've 
understand it! 

 

So based on those expectations and back to your example, we don't need to wait 
before calling subscribe (that's handled internally by the 
HeartbeatRequestManager as described above). I wonder if it's the fact that in 
the failed case you're polling 10 times only (vs. 100 times in the successful 
case)?? In order to receive records, we do need to make sure that we are 
calling poll after the assignment has been received (so the consumer issues a 
fetch request for the partitions assigned). Note that even when you poll for 1s 
in your test, a poll that happens before the assignment has been received, will 
block for 1s but it's doomed to return empty, because it is not waiting for 
records from the topics you're interested in (no partitions assigned yet). 
Could you make sure that the test is calling poll after the assignment has been 
received? (I would suggest just polling while true for a certain amount of 
time, no sleeping after the subscribe needed).

 

Sorry, i make you confused. 

I intended to try it 100 times for both failure and success cases, but the code 
was set to attempt only 10 times in failure case. Anyway, as you suggested, I 
proceeded by logging the {{poll()}} attempts. 

!image-2024-05-07-08-34-06-855.png|width=721,height=269!
 # The consumer calls {{poll()}} up to 1000 times.
 # consumer will leave log ("i : " + i) each by each try. 
 # if consumer success to poll not empty record, consumer do countdown of 
countDownLatch. and then, we can check whether countDownLatch is 0. 

!image-2024-05-07-08-36-40-656.png|width=848,height=315!

I waited until it was called 430 times. it means consumer wait for assignment 
during about 430 sec. 

However, consumer could not get their assignment yet. 

!image-2024-05-07-08-38-27-753.png|width=1654,height=289!

However, after receiving the initial FindCoordinator Request, the broker does 
not perform any action. Please see the log above.

Broker don't have any log after 2024-05-06 23:29:27, but by the time of the 
430th attempt, it was already 2024-05-06 23:39:00. 

 

Anyway, it seems that at least one of the consumer or the broker has a 
potential issue.

What do you think? 

> KIP-848 : Consumer will not receive assignment forever because of concurrent 
> issue.
> -----------------------------------------------------------------------------------
>
>                 Key: KAFKA-16670
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16670
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: sanghyeok An
>            Priority: Major
>         Attachments: image-2024-05-07-08-34-06-855.png, 
> image-2024-05-07-08-36-22-983.png, image-2024-05-07-08-36-40-656.png, 
> image-2024-05-07-08-38-27-753.png
>
>
> *Related Code*
>  * Consumer get assignment Successfully :
>  ** 
> [https://github.com/chickenchickenlove/new-consumer-error/blob/8c1d74db1ec60350c28f5ed25f595559180dc603/src/test/java/com/example/MyTest.java#L35-L57]
>  * Consumer get be stuck Forever because of concurrent issue:
>  ** 
> [https://github.com/chickenchickenlove/new-consumer-error/blob/8c1d74db1ec60350c28f5ed25f595559180dc603/src/test/java/com/example/MyTest.java#L61-L79]
>  
> *Unexpected behaviour*
>  * 
> Broker is sufficiently slow.
>  * When a KafkaConsumer is created and immediately subscribes to a topic
> If both conditions are met, {{Consumer}} can potentially never receive 
> {{TopicPartition}} assignments and become stuck indefinitely.
> In case of new broker and new consumer, when consumer are created, consumer 
> background thread send a request to broker. (I guess groupCoordinator 
> Heartbeat request). In that time, if broker does not load metadata from 
> {{{}__consumer_offset{}}}, broker will start to schedule load metadata. After 
> broker load metadata completely, consumer background thread think 'this 
> broker is valid group coordinator'.
> However, consumer can send {{subscribe}} request to broker before {{broker}} 
> reply about {{{}groupCoordinator HeartBeat Request{}}}. In that case, 
> consumer seems to be stuck.
> If both conditions are met, the {{Consumer}} can potentially never receive 
> {{TopicPartition}} assignments and may become indefinitely stuck. In the case 
> of a new {{broker}} and new {{{}consumer{}}}, when the consumer is created, 
> {{consumer background thread}} start to send a request to the broker. (I 
> believe this is a {{{}GroupCoordinator Heartbeat request{}}}) During this 
> time, if the {{broker}} has not yet loaded metadata from 
> {{{}__consumer_offsets{}}}, it will begin to schedule metadata loading. Once 
> the broker has completely loaded the metadata, the {{consumer background 
> thread}} recognizes this broker as a valid group coordinator. However, there 
> is a possibility that the {{consumer}} can send a {{subscribe request}} to 
> the {{broker}} before the {{broker}} has replied to the {{{}GroupCoordinator 
> Heartbeat Request{}}}. In such a scenario, the {{consumer}} appears to be 
> stuck.
>  
> You can check this scenario, in the 
> {{{}src/test/java/com/example/MyTest#should_fail_because_consumer_try_to_poll_before_background_thread_get_valid_coordinator{}}}.
>  If there is no sleep time to wait {{{}GroupCoordinator Heartbeat 
> Request{}}}, {{consumer}} will be always stuck. If there is a little sleep 
> time, {{consumer}} will always receive assignment.
>  
> README : 
> [https://github.com/chickenchickenlove/new-consumer-error/blob/main/README.md]
>  
> In my case, consumer get assignment in `docker-compose` : it means not enough 
> slow. 
> However, consumer cannot get assignmet in `testcontainers` without little 
> waiting time. : it means enough slow to cause concurrent issue. 
> `testconatiners` is docker in docker, thus `testcontainers` will be slower 
> than `docker-compose`. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to