We fixed many many bugs since August. Since we are about to release 0.9.0 (with SSL!), maybe wait a day and go with a released and tested version.
On Mon, Nov 23, 2015 at 3:01 PM, Qi Xu <shkir...@gmail.com> wrote: > Forgot to mention is that the Kafka version we're using is from Aug's > Trunk branch---which has the SSL support. > > Thanks again, > Qi > > > On Mon, Nov 23, 2015 at 2:29 PM, Qi Xu <shkir...@gmail.com> wrote: > >> Loop another guy from our team. >> >> On Mon, Nov 23, 2015 at 2:26 PM, Qi Xu <shkir...@gmail.com> wrote: >> >>> Hi folks, >>> We have a 10 node cluster and have several topics. Each topic has about >>> 256 partitions with 3 replica factor. Now we run into an issue that in some >>> topic, a few partition (< 10)'s leader is -1 and all of them has only one >>> synced partition. >>> >>> From the Kafka manager, here's the snapshot: >>> [image: Inline image 2] >>> >>> [image: Inline image 1] >>> >>> here's the state log: >>> [2015-11-23 21:57:58,598] ERROR Controller 1 epoch 435499 initiated >>> state change for partition [userlogs,84] from OnlinePartition to >>> OnlinePartition failed (state.change.logger) >>> kafka.common.StateChangeFailedException: encountered error while >>> electing leader for partition [userlogs,84] due to: Preferred replica 0 for >>> partition [userlogs,84] is either not alive or not in the isr. Current >>> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}]. >>> Caused by: kafka.common.StateChangeFailedException: Preferred replica 0 >>> for partition [userlogs,84] is either not alive or not in the isr. Current >>> leader and ISR: [{"leader":-1,"leader_epoch":203,"isr":[1]}] >>> >>> My question is: >>> 1) how could this happen and how can I fix it or work around it? >>> 2) Is 256 partitions too big? We have about 200+ cores for spark >>> streaming job. >>> >>> Thanks, >>> Qi >>> >>> >> >