RE: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-08-05 Thread Hargett, Phil
l#comment-13729537 From: Hargett, Phil Sent: Friday, August 02, 2013 1:36 PM To: Jun Rao Cc: users@kafka.apache.org Subject: RE: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_ I

RE: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-08-02 Thread Hargett, Phil
.com] Sent: Wednesday, July 31, 2013 12:16 AM To: Hargett, Phil Cc: users@kafka.apache.org Subject: Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_ Hmm, that's a good theory. My understanding is that you have one thre

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-30 Thread Jun Rao
Hmm, that's a good theory. My understanding is that you have one thread that first shuts down the consumer connector and then creates new streams on the same connector. Is that right? If so, I don't think the race condition can happen. When we shutdown the consumer connector, it waits until the lea

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-30 Thread Hargett, Phil
Hmmm...is there a reason that stopConnections in ConsumerFetcherManager does not grab a lock before shutting down the leaderFinderThread? I don't see what prevents startConnections/stopConnections from causing a race in certain conditions and if called on separate threads. Given there are no lo

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-30 Thread Hargett, Phil
Oh, we're building from source multiple times per week, either until 0.8 comes out of beta or we ourselves slide towards production. :) Depending on where the builds were done (Dev vs official), we have commits 76d3905 or b1891e7. Both are more recent than beta 1, I believe. :) On Jul 30, 201

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-30 Thread Jun Rao
What's the revision of the 0.8 branch that you used? If that's older than the beta1 release, I recommend that you upgrade. Thanks, Jun On Tue, Jul 30, 2013 at 3:09 AM, Hargett, Phil < phil.harg...@mirror-image.com> wrote: > No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-30 Thread Hargett, Phil
No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope not). I had my consumer open for 90 secs in this case before shutting it down and disposing of it—hence any races caused by fast startup/shutdown should not have been relevant. I build from source off of the 0.8 branch, so i

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-29 Thread Jun Rao
Hmm, it takes 90 secs to connect to ZK? That seems way too long. Is your ZK healthy. Also, are you on the 0.8 beta1 release? If not, could you try that one? It may not be related, but we did fix some consumer side deadlock issues there. Thanks, Jun On Mon, Jul 29, 2013 at 9:02 AM, Hargett, Phi

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-29 Thread Hargett, Phil
Why would a consumer that has been shutdown still be rebalancing? Zookeeper session timeout (zookeeper.session.timeout.ms) is 1000 and sync time (zookeeper.sync.timeout.ms) is 500. Also, the timeout for the thread that looks for the leader is left at the default 200 milliseconds (refresh.leader

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

2013-07-28 Thread Jun Rao
Ok. So, it seems that the issue is there are lots of rebalances in the consumer. How long did you set the zk session expiration time? A typical reason for many rebalances is the consumer side GC. If so, you will see Zookeeper session expirations in the consumer log (grep for Expired). Occasional re