Hi team,

I have 10 high level consumers connecting to Kafka and one of them kept
complaining "conflicted ephemeral node" for about 8 hours. The log was
filled with below exception

[2015-07-07 14:03:51,615] INFO conflict in
/consumers/group/ids/test-1435856975563-9a9fdc6c data:
{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}
stored data:
{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"}
(kafka.utils.ZkUtils$)
[2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}]
at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
different session, hence I will backoff for this node to be deleted by
Zookeeper and retry (kafka.utils.ZkUtils$)

In the meantime zookeeper reported below exception for the same time span

2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when
processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
= NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

At the end zookeeper timed out the session and consumers triggered
rebalance.

I know that conflicted ephemeral node warning is to handle a zookeeper bug
that session expiration and ephemeral node deletion are not done atomically
but as indicated from zookeeper log the zookeeper never got a chance to
delete the ephemeral node which made me think that the session was not
expired at that time. And for some reason zookeeper fired session expire
event which subsequently invoked ZKSessionExpireListener.  I was just
wondering if anyone have ever encountered similar issue before and what I
can do at zookeeper side to prevent this?

Another problem is that createEphemeralPathExpectConflictHandleZKBug call
is wrapped in a while(true) loop which runs forever until the ephemeral
node is created. Would it be better that we can employ an exponential retry
policy with a max number of retries so that it has a chance to re-throw the
exception back to caller and let caller handle it in situation like above?

Reply via email to