Hi team, I have 10 high level consumers connecting to Kafka and one of them kept complaining "conflicted ephemeral node" for about 8 hours. The log was filled with below exception
[2015-07-07 14:03:51,615] INFO conflict in /consumers/group/ids/test-1435856975563-9a9fdc6c data: {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"} stored data: {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"} (kafka.utils.ZkUtils$) [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node [{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}] at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) In the meantime zookeeper reported below exception for the same time span 2015-07-07 22:45:09,687 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26 zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c At the end zookeeper timed out the session and consumers triggered rebalance. I know that conflicted ephemeral node warning is to handle a zookeeper bug that session expiration and ephemeral node deletion are not done atomically but as indicated from zookeeper log the zookeeper never got a chance to delete the ephemeral node which made me think that the session was not expired at that time. And for some reason zookeeper fired session expire event which subsequently invoked ZKSessionExpireListener. I was just wondering if anyone have ever encountered similar issue before and what I can do at zookeeper side to prevent this? Another problem is that createEphemeralPathExpectConflictHandleZKBug call is wrapped in a while(true) loop which runs forever until the ephemeral node is created. Would it be better that we can employ an exponential retry policy with a max number of retries so that it has a chance to re-throw the exception back to caller and let caller handle it in situation like above?