Re: Got conflicted ephemeral node exception for several hours

2015-08-04 Thread Jaikiran Pai


I am on Kafka 0.8.2.1 (Java 8) and have happened to run into this same 
issue where the KafkaServer (broker) goes into a indefinite while loop 
writing out this message:


[2015-08-04 15:45:12,350] INFO conflict in /brokers/ids/0 data: 
{jmx_port:-1,timestamp:1438661432074,host:foo-bar,version:1,port:9092} 
stored data: 
{jmx_port:-1,timestamp:1438661429589,host:foo-bar,version:1,port:9092} 
(kafka.utils.ZkUtils$)
[2015-08-04 15:45:12,352] INFO I wrote this conflicted ephemeral node 
[{jmx_port:-1,timestamp:1438661432074,host:foo-bar,version:1,port:9092}] 
at /brokers/ids/0 a while back in a different session, hence I will 
backoff for this node to be deleted by Zookeeper and retry 
(kafka.utils.ZkUtils$)


These above 2 lines have been repeating continuously every few seconds 
for the past 20 odd hours on this broker and this broker has been 
rendered useless and is contributing to high CPU usage.


As a result the consumers have gone into a state where they no longer 
consume the messages. Furthermore, this continuous looping has put Kafka 
process on top of the CPU usage. I understand that bouncing the consumer 
is an option and probably will fix this, but in our real production 
environments, we won't be able to bounce the consumers. I currently have 
access to logs (some of which has been pasted here). Is there any chance 
these logs help in narrowing down the issue and fixing the root cause. 
Can we also please add a retry max limit kind of thing in this Zookeeper 
node creation logic instead of going into a indefinite while loop?


I have maintained the original timestamps in the logs so as to help 
narrow down the issue. The 1438661432074 (milli second) in the log 
translates to Aug 03 2015 21:10:32 (PDT) and 1438661429589 translates to 
Aug 03 2015 21:10:30 (PDT). I have included that part of the log snippet 
from the server.log of the broker (10.95.100.31).



[2015-08-03 21:10:29,805] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)

java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:123)
at kafka.network.MultiSend.writeTo(Transmission.scala:101)
at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:231)
at kafka.network.Processor.write(SocketServer.scala:472)
at kafka.network.Processor.run(SocketServer.scala:342)
at java.lang.Thread.run(Thread.java:745)
[2015-08-03 21:10:29,938] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)

java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at kafka.utils.Utils$.read(Utils.scala:380)
at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)

at kafka.network.Processor.read(SocketServer.scala:444)
at kafka.network.Processor.run(SocketServer.scala:340)
at java.lang.Thread.run(Thread.java:745)
[2015-08-03 21:10:30,045] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)

java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:123)
at kafka.network.MultiSend.writeTo(Transmission.scala:101)
at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:231)
at kafka.network.Processor.write(SocketServer.scala:472)
at kafka.network.Processor.run(SocketServer.scala:342)
at java.lang.Thread.run(Thread.java:745)

 a lot more similar exceptions 


[2015-08-03 21:10:31,304] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,397] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,399] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,445] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)


 bunch of similar logs as above 

[2015-08-03 21:10:31,784] INFO [ReplicaFetcherManager on broker 0] 
Removed fetcher for partitions [partition list] 
(kafka.server.ReplicaFetcherManager)
[2015-08-03 21:10:31,860] INFO Closing socket connection to 
/10.95.100.31. 

Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread Jiangjie Qin
Hi Tao,

We see this error from time to time but did not think of this as a big
issue. Any reason it bothers you much?
I¹m not sure if throwing exception to user on this exception is a good
handling or not. What are user supposed to do in that case other than
retry?

Thanks,

Jiangjie (Becket) Qin

On 7/12/15, 7:16 PM, tao xiao xiaotao...@gmail.com wrote:

We saw the error again in our cluster.  Anyone has the same issue before?

On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:

 Bump the thread. Any help would be appreciated.

 On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:

 Additional info
 Kafka version: 0.8.2.1
 zookeeper: 3.4.6

 On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:

 Hi team,

 I have 10 high level consumers connecting to Kafka and one of them
kept
 complaining conflicted ephemeral node for about 8 hours. The log was
 filled with below exception

 [2015-07-07 14:03:51,615] INFO conflict in
 /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 
{version:1,subscription:{test.*:1},pattern:white_list,timest
amp:1436275631510}
 stored data:
 
{version:1,subscription:{test.*:1},pattern:white_list,timest
amp:1436275558570}
 (kafka.utils.ZkUtils$)
 [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 
[{version:1,subscription:{test.*:1},pattern:white_list,times
tamp:1436275631510}]
 at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
 different session, hence I will backoff for this node to be deleted by
 Zookeeper and retry (kafka.utils.ZkUtils$)

 In the meantime zookeeper reported below exception for the same time
span

 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
 when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
 zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
 Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
Error:KeeperErrorCode
 = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

 At the end zookeeper timed out the session and consumers triggered
 rebalance.

 I know that conflicted ephemeral node warning is to handle a zookeeper
 bug that session expiration and ephemeral node deletion are not done
 atomically but as indicated from zookeeper log the zookeeper never
got a
 chance to delete the ephemeral node which made me think that the
session
 was not expired at that time. And for some reason zookeeper fired
session
 expire event which subsequently invoked ZKSessionExpireListener.  I
was
 just wondering if anyone have ever encountered similar issue before
and
 what I can do at zookeeper side to prevent this?

 Another problem is that createEphemeralPathExpectConflictHandleZKBug
 call is wrapped in a while(true) loop which runs forever until the
 ephemeral node is created. Would it be better that we can employ an
 exponential retry policy with a max number of retries so that it has a
 chance to re-throw the exception back to caller and let caller handle
it in
 situation like above?





Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread Mayuresh Gharat
Bouncing the consumers should solve this issue in most cases.

Thanks,

Mayuresh

On Sun, Jul 12, 2015 at 8:21 PM, Jiangjie Qin j...@linkedin.com.invalid
wrote:

 Hi Tao,

 We see this error from time to time but did not think of this as a big
 issue. Any reason it bothers you much?
 I¹m not sure if throwing exception to user on this exception is a good
 handling or not. What are user supposed to do in that case other than
 retry?

 Thanks,

 Jiangjie (Becket) Qin

 On 7/12/15, 7:16 PM, tao xiao xiaotao...@gmail.com wrote:

 We saw the error again in our cluster.  Anyone has the same issue before?
 
 On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:
 
  Bump the thread. Any help would be appreciated.
 
  On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:
 
  Additional info
  Kafka version: 0.8.2.1
  zookeeper: 3.4.6
 
  On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:
 
  Hi team,
 
  I have 10 high level consumers connecting to Kafka and one of them
 kept
  complaining conflicted ephemeral node for about 8 hours. The log was
  filled with below exception
 
  [2015-07-07 14:03:51,615] INFO conflict in
  /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
 amp:1436275631510}
  stored data:
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
 amp:1436275558570}
  (kafka.utils.ZkUtils$)
  [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 
 [{version:1,subscription:{test.*:1},pattern:white_list,times
 tamp:1436275631510}]
  at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
  different session, hence I will backoff for this node to be deleted by
  Zookeeper and retry (kafka.utils.ZkUtils$)
 
  In the meantime zookeeper reported below exception for the same time
 span
 
  2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
  cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
  when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
  zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
  Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
 Error:KeeperErrorCode
  = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
 
  At the end zookeeper timed out the session and consumers triggered
  rebalance.
 
  I know that conflicted ephemeral node warning is to handle a zookeeper
  bug that session expiration and ephemeral node deletion are not done
  atomically but as indicated from zookeeper log the zookeeper never
 got a
  chance to delete the ephemeral node which made me think that the
 session
  was not expired at that time. And for some reason zookeeper fired
 session
  expire event which subsequently invoked ZKSessionExpireListener.  I
 was
  just wondering if anyone have ever encountered similar issue before
 and
  what I can do at zookeeper side to prevent this?
 
  Another problem is that createEphemeralPathExpectConflictHandleZKBug
  call is wrapped in a while(true) loop which runs forever until the
  ephemeral node is created. Would it be better that we can employ an
  exponential retry policy with a max number of retries so that it has a
  chance to re-throw the exception back to caller and let caller handle
 it in
  situation like above?
 
 




-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread tao xiao
Restart the consumers does fix the issue. But since the zk retry is wrapped
in an infinite loop it doesn't give a chance to consumer to respond it
until some one notices and restarts. Why I suggest to have a maximum retry
policy is if max retry is reached it can invoke a customer handler which I
can then inject a restart call so that it can remedy itself without
people's attention.

On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin j...@linkedin.com.invalid wrote:

 Hi Tao,

 We see this error from time to time but did not think of this as a big
 issue. Any reason it bothers you much?
 I¹m not sure if throwing exception to user on this exception is a good
 handling or not. What are user supposed to do in that case other than
 retry?

 Thanks,

 Jiangjie (Becket) Qin

 On 7/12/15, 7:16 PM, tao xiao xiaotao...@gmail.com wrote:

 We saw the error again in our cluster.  Anyone has the same issue before?
 
 On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:
 
  Bump the thread. Any help would be appreciated.
 
  On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:
 
  Additional info
  Kafka version: 0.8.2.1
  zookeeper: 3.4.6
 
  On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:
 
  Hi team,
 
  I have 10 high level consumers connecting to Kafka and one of them
 kept
  complaining conflicted ephemeral node for about 8 hours. The log was
  filled with below exception
 
  [2015-07-07 14:03:51,615] INFO conflict in
  /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
 amp:1436275631510}
  stored data:
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
 amp:1436275558570}
  (kafka.utils.ZkUtils$)
  [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 
 [{version:1,subscription:{test.*:1},pattern:white_list,times
 tamp:1436275631510}]
  at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
  different session, hence I will backoff for this node to be deleted by
  Zookeeper and retry (kafka.utils.ZkUtils$)
 
  In the meantime zookeeper reported below exception for the same time
 span
 
  2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
  cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
  when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
  zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
  Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
 Error:KeeperErrorCode
  = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
 
  At the end zookeeper timed out the session and consumers triggered
  rebalance.
 
  I know that conflicted ephemeral node warning is to handle a zookeeper
  bug that session expiration and ephemeral node deletion are not done
  atomically but as indicated from zookeeper log the zookeeper never
 got a
  chance to delete the ephemeral node which made me think that the
 session
  was not expired at that time. And for some reason zookeeper fired
 session
  expire event which subsequently invoked ZKSessionExpireListener.  I
 was
  just wondering if anyone have ever encountered similar issue before
 and
  what I can do at zookeeper side to prevent this?
 
  Another problem is that createEphemeralPathExpectConflictHandleZKBug
  call is wrapped in a while(true) loop which runs forever until the
  ephemeral node is created. Would it be better that we can employ an
  exponential retry policy with a max number of retries so that it has a
  chance to re-throw the exception back to caller and let caller handle
 it in
  situation like above?
 
 




Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread tao xiao
We saw the error again in our cluster.  Anyone has the same issue before?

On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:

 Bump the thread. Any help would be appreciated.

 On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:

 Additional info
 Kafka version: 0.8.2.1
 zookeeper: 3.4.6

 On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:

 Hi team,

 I have 10 high level consumers connecting to Kafka and one of them kept
 complaining conflicted ephemeral node for about 8 hours. The log was
 filled with below exception

 [2015-07-07 14:03:51,615] INFO conflict in
 /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}
 stored data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275558570}
 (kafka.utils.ZkUtils$)
 [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 [{version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}]
 at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
 different session, hence I will backoff for this node to be deleted by
 Zookeeper and retry (kafka.utils.ZkUtils$)

 In the meantime zookeeper reported below exception for the same time span

 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
 when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
 zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
 Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
 = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

 At the end zookeeper timed out the session and consumers triggered
 rebalance.

 I know that conflicted ephemeral node warning is to handle a zookeeper
 bug that session expiration and ephemeral node deletion are not done
 atomically but as indicated from zookeeper log the zookeeper never got a
 chance to delete the ephemeral node which made me think that the session
 was not expired at that time. And for some reason zookeeper fired session
 expire event which subsequently invoked ZKSessionExpireListener.  I was
 just wondering if anyone have ever encountered similar issue before and
 what I can do at zookeeper side to prevent this?

 Another problem is that createEphemeralPathExpectConflictHandleZKBug
 call is wrapped in a while(true) loop which runs forever until the
 ephemeral node is created. Would it be better that we can employ an
 exponential retry policy with a max number of retries so that it has a
 chance to re-throw the exception back to caller and let caller handle it in
 situation like above?




Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread Mayuresh Gharat
That would solve this. But it looks like a work around. We need to check
why this happens exactly and get to the root cause. What do you think?
Getting to the root cause of this might be really useful.

Thanks,

Mayuresh

On Sun, Jul 12, 2015 at 8:45 PM, tao xiao xiaotao...@gmail.com wrote:

 Restart the consumers does fix the issue. But since the zk retry is wrapped
 in an infinite loop it doesn't give a chance to consumer to respond it
 until some one notices and restarts. Why I suggest to have a maximum retry
 policy is if max retry is reached it can invoke a customer handler which I
 can then inject a restart call so that it can remedy itself without
 people's attention.

 On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin j...@linkedin.com.invalid
 wrote:

  Hi Tao,
 
  We see this error from time to time but did not think of this as a big
  issue. Any reason it bothers you much?
  I¹m not sure if throwing exception to user on this exception is a good
  handling or not. What are user supposed to do in that case other than
  retry?
 
  Thanks,
 
  Jiangjie (Becket) Qin
 
  On 7/12/15, 7:16 PM, tao xiao xiaotao...@gmail.com wrote:
 
  We saw the error again in our cluster.  Anyone has the same issue
 before?
  
  On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:
  
   Bump the thread. Any help would be appreciated.
  
   On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:
  
   Additional info
   Kafka version: 0.8.2.1
   zookeeper: 3.4.6
  
   On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:
  
   Hi team,
  
   I have 10 high level consumers connecting to Kafka and one of them
  kept
   complaining conflicted ephemeral node for about 8 hours. The log
 was
   filled with below exception
  
   [2015-07-07 14:03:51,615] INFO conflict in
   /consumers/group/ids/test-1435856975563-9a9fdc6c data:
  
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
  amp:1436275631510}
   stored data:
  
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
  amp:1436275558570}
   (kafka.utils.ZkUtils$)
   [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral
 node
  
 
 [{version:1,subscription:{test.*:1},pattern:white_list,times
  tamp:1436275631510}]
   at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in
 a
   different session, hence I will backoff for this node to be deleted
 by
   Zookeeper and retry (kafka.utils.ZkUtils$)
  
   In the meantime zookeeper reported below exception for the same time
  span
  
   2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
   cport:-1)::PrepRequestProcessor@645] - Got user-level
 KeeperException
   when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
   zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
   Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
  Error:KeeperErrorCode
   = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
  
   At the end zookeeper timed out the session and consumers triggered
   rebalance.
  
   I know that conflicted ephemeral node warning is to handle a
 zookeeper
   bug that session expiration and ephemeral node deletion are not done
   atomically but as indicated from zookeeper log the zookeeper never
  got a
   chance to delete the ephemeral node which made me think that the
  session
   was not expired at that time. And for some reason zookeeper fired
  session
   expire event which subsequently invoked ZKSessionExpireListener.  I
  was
   just wondering if anyone have ever encountered similar issue before
  and
   what I can do at zookeeper side to prevent this?
  
   Another problem is that createEphemeralPathExpectConflictHandleZKBug
   call is wrapped in a while(true) loop which runs forever until the
   ephemeral node is created. Would it be better that we can employ an
   exponential retry policy with a max number of retries so that it
 has a
   chance to re-throw the exception back to caller and let caller
 handle
  it in
   situation like above?
  
  
 
 




-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: Got conflicted ephemeral node exception for several hours

2015-07-12 Thread tao xiao
Finding out root cause is definitely something we should do and I hope the
new Java consumer API will not have such issues anymore. Here are my
observation of the issue: the zk retry was caused by session timeout. In
theory the ephemeral node should be removed after the session is terminated
by zk but in fact the node still presented after the session was gone. I
saw lots of slow fsync warning in zk log like below but I am not sure if
the warning has anything to do with the error I encountered.

2015-07-12 22:01:21,195 [myid:2] - WARN  [SyncThread:2:FileTxnLog@334] -
fsync-ing the write ahead log in SyncThread:2 took 1023ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide


I know in Linkedin ZK is installed on SSD does it help with this?

On Mon, 13 Jul 2015 at 12:50 Mayuresh Gharat gharatmayures...@gmail.com
wrote:

 That would solve this. But it looks like a work around. We need to check
 why this happens exactly and get to the root cause. What do you think?
 Getting to the root cause of this might be really useful.

 Thanks,

 Mayuresh

 On Sun, Jul 12, 2015 at 8:45 PM, tao xiao xiaotao...@gmail.com wrote:

  Restart the consumers does fix the issue. But since the zk retry is
 wrapped
  in an infinite loop it doesn't give a chance to consumer to respond it
  until some one notices and restarts. Why I suggest to have a maximum
 retry
  policy is if max retry is reached it can invoke a customer handler which
 I
  can then inject a restart call so that it can remedy itself without
  people's attention.
 
  On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin j...@linkedin.com.invalid
  wrote:
 
   Hi Tao,
  
   We see this error from time to time but did not think of this as a big
   issue. Any reason it bothers you much?
   I¹m not sure if throwing exception to user on this exception is a good
   handling or not. What are user supposed to do in that case other than
   retry?
  
   Thanks,
  
   Jiangjie (Becket) Qin
  
   On 7/12/15, 7:16 PM, tao xiao xiaotao...@gmail.com wrote:
  
   We saw the error again in our cluster.  Anyone has the same issue
  before?
   
   On Fri, 10 Jul 2015 at 13:26 tao xiao xiaotao...@gmail.com wrote:
   
Bump the thread. Any help would be appreciated.
   
On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:
   
Additional info
Kafka version: 0.8.2.1
zookeeper: 3.4.6
   
On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:
   
Hi team,
   
I have 10 high level consumers connecting to Kafka and one of them
   kept
complaining conflicted ephemeral node for about 8 hours. The log
  was
filled with below exception
   
[2015-07-07 14:03:51,615] INFO conflict in
/consumers/group/ids/test-1435856975563-9a9fdc6c data:
   
  
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
   amp:1436275631510}
stored data:
   
  
 
 {version:1,subscription:{test.*:1},pattern:white_list,timest
   amp:1436275558570}
(kafka.utils.ZkUtils$)
[2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral
  node
   
  
 
 [{version:1,subscription:{test.*:1},pattern:white_list,times
   tamp:1436275631510}]
at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back
 in
  a
different session, hence I will backoff for this node to be
 deleted
  by
Zookeeper and retry (kafka.utils.ZkUtils$)
   
In the meantime zookeeper reported below exception for the same
 time
   span
   
2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@645] - Got user-level
  KeeperException
when processing sessionid:0x44e657ff19c0019 type:create
 cxid:0x7a26
zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
   Error:KeeperErrorCode
= NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
   
At the end zookeeper timed out the session and consumers triggered
rebalance.
   
I know that conflicted ephemeral node warning is to handle a
  zookeeper
bug that session expiration and ephemeral node deletion are not
 done
atomically but as indicated from zookeeper log the zookeeper never
   got a
chance to delete the ephemeral node which made me think that the
   session
was not expired at that time. And for some reason zookeeper fired
   session
expire event which subsequently invoked ZKSessionExpireListener.
 I
   was
just wondering if anyone have ever encountered similar issue
 before
   and
what I can do at zookeeper side to prevent this?
   
Another problem is that
 createEphemeralPathExpectConflictHandleZKBug
call is wrapped in a while(true) loop which runs forever until the
ephemeral node is created. Would it be better that we can employ
 an
exponential retry policy with a max number of retries so that it
  has a
chance to re-throw the exception back to caller and let caller
  handle
   it in
situation 

Re: Got conflicted ephemeral node exception for several hours

2015-07-09 Thread tao xiao
Bump the thread. Any help would be appreciated.

On Wed, 8 Jul 2015 at 20:09 tao xiao xiaotao...@gmail.com wrote:

 Additional info
 Kafka version: 0.8.2.1
 zookeeper: 3.4.6

 On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:

 Hi team,

 I have 10 high level consumers connecting to Kafka and one of them kept
 complaining conflicted ephemeral node for about 8 hours. The log was
 filled with below exception

 [2015-07-07 14:03:51,615] INFO conflict in
 /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}
 stored data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275558570}
 (kafka.utils.ZkUtils$)
 [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 [{version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}]
 at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
 different session, hence I will backoff for this node to be deleted by
 Zookeeper and retry (kafka.utils.ZkUtils$)

 In the meantime zookeeper reported below exception for the same time span

 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
 when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
 zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
 Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
 = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

 At the end zookeeper timed out the session and consumers triggered
 rebalance.

 I know that conflicted ephemeral node warning is to handle a zookeeper
 bug that session expiration and ephemeral node deletion are not done
 atomically but as indicated from zookeeper log the zookeeper never got a
 chance to delete the ephemeral node which made me think that the session
 was not expired at that time. And for some reason zookeeper fired session
 expire event which subsequently invoked ZKSessionExpireListener.  I was
 just wondering if anyone have ever encountered similar issue before and
 what I can do at zookeeper side to prevent this?

 Another problem is that createEphemeralPathExpectConflictHandleZKBug call
 is wrapped in a while(true) loop which runs forever until the ephemeral
 node is created. Would it be better that we can employ an exponential retry
 policy with a max number of retries so that it has a chance to re-throw the
 exception back to caller and let caller handle it in situation like above?




Re: Got conflicted ephemeral node exception for several hours

2015-07-08 Thread tao xiao
Additional info
Kafka version: 0.8.2.1
zookeeper: 3.4.6

On Wed, 8 Jul 2015 at 20:07 tao xiao xiaotao...@gmail.com wrote:

 Hi team,

 I have 10 high level consumers connecting to Kafka and one of them kept
 complaining conflicted ephemeral node for about 8 hours. The log was
 filled with below exception

 [2015-07-07 14:03:51,615] INFO conflict in
 /consumers/group/ids/test-1435856975563-9a9fdc6c data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}
 stored data:
 {version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275558570}
 (kafka.utils.ZkUtils$)
 [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
 [{version:1,subscription:{test.*:1},pattern:white_list,timestamp:1436275631510}]
 at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
 different session, hence I will backoff for this node to be deleted by
 Zookeeper and retry (kafka.utils.ZkUtils$)

 In the meantime zookeeper reported below exception for the same time span

 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
 cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
 when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
 zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
 Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
 = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

 At the end zookeeper timed out the session and consumers triggered
 rebalance.

 I know that conflicted ephemeral node warning is to handle a zookeeper bug
 that session expiration and ephemeral node deletion are not done atomically
 but as indicated from zookeeper log the zookeeper never got a chance to
 delete the ephemeral node which made me think that the session was not
 expired at that time. And for some reason zookeeper fired session expire
 event which subsequently invoked ZKSessionExpireListener.  I was just
 wondering if anyone have ever encountered similar issue before and what I
 can do at zookeeper side to prevent this?

 Another problem is that createEphemeralPathExpectConflictHandleZKBug call
 is wrapped in a while(true) loop which runs forever until the ephemeral
 node is created. Would it be better that we can employ an exponential retry
 policy with a max number of retries so that it has a chance to re-throw the
 exception back to caller and let caller handle it in situation like above?