Re: topics stuck in "Leader: -1" after crash while migrating topics

Ismael Juma Fri, 28 Apr 2017 15:57:10 -0700

There are indeed some known issues in the Controller that require care to
avoid. Onur has recently contributed a PR that simplifies the concurrency
model of the Controller:


https://github.com/apache/kafka/commit/bb663d04febcadd4f120e0ff5c5919ca8bf7e971

This is a good first step and will be part of 0.11.0.0. The next step will
be to fix the session expiration issues. It's a non-trivial amount of work
so the current target is the feature release after 0.11.0.0.

Ismael

On Fri, Apr 28, 2017 at 8:30 PM, Michal Borowiecki <
michal.borowie...@openbet.com> wrote:

> Hi James,
>
> This "Cached zkVersion [x] not equal to that in zookeeper" issue bit us
> once in production and I found these ticket to be relevant:
> KAFKA-2729 <https://issues.apache.org/jira/browse/KAFKA-2729>
> KAFKA-3042 <https://issues.apache.org/jira/browse/KAFKA-3042>
> KAFKA-3083 <https://issues.apache.org/jira/browse/KAFKA-3083>
> Unfortunately, I don't believe there is a fix for it yet, or in the making.
>
> Thanks,
> Michał
>
>
> On 28/04/17 19:26, James Brown wrote:
>
> For what it's worth, shutting down the entire cluster and then restarting
> it did address this issue.
>
> I'd love anyone's thoughts on what the "correct" fix would be here.
>
> On Fri, Apr 28, 2017 at 10:58 AM, James Brown <jbr...@easypost.com> 
> <jbr...@easypost.com> wrote:
>
>
> The following is also appearing in the logs a lot, if anyone has any ideas:
>
> INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jbr...@easypost.com> 
> <jbr...@easypost.com> wrote:
>
>
> We're running 0.10.1.0 on a five-node cluster.
>
> I was in the process of migrating some topics from having 2 replicas to
> having three replicas when two the five machines in this cluster crashed
> (brokers 2 and 3).
>
> After restarting them, all of the topics that were previously assigned to
> them are unavailable and showing "Leader: -1".
>
> Example kafka-topics output:
>
> % kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
> Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
> Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
> Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
> Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
> Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
> Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:
>
> Note that I wasn't even moving any of the __consumer_offsets partitions,
> so I'm not sure if the fact that a reassignment was in progress is a red
> herring or not.
>
> The logs are full of
>
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 
> 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 
> 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
>
> What can I do to fix this? Should I manually reassign all partitions
> that were led by brokers 2 or 3 to only have whatever the third broker was
> in their replica-set as their replica set? Do I need to temporarily enable
> unclean elections?
>
> I've never seen a cluster fail this way...
>
> --
> James Brown
> Engineer
>
>
> --
> James Brown
> Engineer
>
>
>
> --
> <http://www.openbet.com/> Michal Borowiecki
> Senior Software Engineer L4
> T: +44 208 742 1600 <020%208742%201600>
>
>
> +44 203 249 8448 <020%203249%208448>
>
>
>
> E: michal.borowie...@openbet.com
> W: www.openbet.com
> OpenBet Ltd
>
> Chiswick Park Building 9
>
> 566 Chiswick High Rd
>
> London
>
> W4 5XT
>
> UK
> <https://www.openbet.com/email_promo>
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> postmas...@openbet.com and delete it from your system as well as any
> copies. The content of e-mails as well as traffic data may be monitored by
> OpenBet for employment and security purposes. To protect the environment
> please do not print this e-mail unless necessary. OpenBet Ltd. Registered
> Office: Chiswick Park Building 9, 566 Chiswick High Road, London, W4 5XT,
> United Kingdom. A company registered in England and Wales. Registered no.
> 3134634. VAT no. GB927523612
>

Re: topics stuck in "Leader: -1" after crash while migrating topics

Reply via email to