[jira] [Commented] (KAFKA-1475) Kafka consumer stops LeaderFinder/FetcherThreads, but application does not know

Hang Qi (JIRA) Fri, 06 Jun 2014 12:20:18 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020269#comment-14020269
 ]


Hang Qi commented on KAFKA-1475:
--------------------------------

Not exactly. Let me recap. 

In original description, step 3. 
3. After read back the content, it founds the content is same as that it is 
going to write, so it logged as "[ZkClient-EventThread-428-ZK/kafka] 
(kafka.utils.ZkUtils$) - 
/consumers/consumerA/ids/consumerA-1400815740329-5055aeee exists with value 
{"pattern":"static", "subscription":{"TOPIC": 1}, "timestamp":"1400846114845", 
"version":1} 

the timestamp is 1400846114845 = 11:55:14,845 UTC 2014, which is 07:55:14,845 
EDT (our log uses EDT). 

Looking at my second comment and also the logs in the attachment, 
1. 07:55:14 zk client got session expire，populate the event to watcher. Thus 
kafka client tried to recreate ephemeral node 
'/kafka8Agg/consumers/myconsumergroup/ids/myconsumergroup_ooo0001-1400815740329-5055aeee

that's the time when kafka client aware that zk session expired and tried to 
re-create ephemeral node. 

However, based on following log

3. 07:55:26 zk sender did not hear server for connectReq response, thus close 
the connection and try to connect.
4. 07:55:40 zk sender established connection with mbus0005 and got session 
0x545f6dc6f510757
5. 07:55:40 zk sender got response of create ephemeral node, zk server 
responded node exist(response code is -110).

zk client successful connected to zk cluster at 07:55:40, and then found the 
ephemeral was created and the content was the same with it was about to write.

But the owner this ephemeral node is 163808312244176699 = 0x245f6cac692073b, 
but the session before expire (0x345f6cac6ed071d), nor the afterward session 
(0x545f6dc6f510757). 

So I feel it is very weird, guessing that, between 07:55:14 and 07:55:26, 
somehow, zk client sent create request to mbus0002，and mbus002 processed it 
successful , but the response is not read by client, the session 
0x245f6cac692073b was created at that time.

> Kafka consumer stops LeaderFinder/FetcherThreads, but application does not 
> know
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-1475
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1475
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 0.8.0
>         Environment: linux, rhel 6.4
>            Reporter: Hang Qi
>            Assignee: Neha Narkhede
>              Labels: consumer
>         Attachments: 5055aeee-zk.txt
>
>
> We encounter an issue of consumers not consuming messages in production. ( 
> this consumer has its own consumer group, and just consumes one topic of 3 
> partitions.)
> Based on the logs, we have following findings:
> 1. Zookeeper session expires, kafka highlevel consumer detected this event, 
> and released old broker parition ownership and re-register consumer.
> 2. Upon creating ephemeral path in Zookeeper, it found that the path still 
> exists, and try to read the content of the node.
> 3. After read back the content, it founds the content is same as that it is 
> going to write, so it logged as "[ZkClient-EventThread-428-ZK/kafka] 
> (kafka.utils.ZkUtils$) - 
> /consumers/consumerA/ids/consumerA-1400815740329-5055aeee exists with value { 
> "pattern":"static", "subscription":{ "TOPIC": 1}, 
> "timestamp":"1400846114845", "version":1 } during connection loss; this is 
> ok", and doing nothing.
> 4. After that, it throws exception indicated that the cause is 
> "org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /consumers/consumerA/ids/consumerA-1400815740329-5055aeee" during 
> rebalance. 
> 5. After all retries failed, it gave up retry and left the 
> LeaderFinderThread, FetcherThread stopped. 
> Step 3 looks very weird, checking the code, there is timestamp contains in 
> the stored data, it may be caused by Zookeeper issue.
> But what I am wondering is that whether it is possible to let application 
> (kafka client users) to know that the underline LeaderFinderThread and 
> FetcherThread are stopped, like allowing application to register some 
> callback or throws some exception (by invalidate the KafkaStream iterator for 
> example)? For me, it is not reasonable for the kafka client to shutdown 
> everything and wait for next rebalance, and let application wait on 
> iterator.hasNext() without knowing that there is something wrong underline.
> I've read about twiki about kafka 0.9 consumer rewrite, and there is a 
> ConsumerRebalanceCallback interface, but I am not sure how long it will take 
> to be ready, and how long it will take for us to migrate. 
> Please help to look at this issue.  Thanks very much!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (KAFKA-1475) Kafka consumer stops LeaderFinder/FetcherThreads, but application does not know

Reply via email to