[jira] [Commented] (KAFKA-7417) Some topics lost / cannot recover their ISR status following broker crash

2018-11-25 Thread Desmond Sindatry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698170#comment-16698170
 ] 

Desmond Sindatry commented on KAFKA-7417:
-

I am seeing the same issue and its not possible to add a new broker. 

Out of sync replicas never come back to sync 

This is the error in the log: 

2018-11-24 22:33:18,008 INFO kafka.server.ReplicaFetcherThread: [ReplicaFetcher 
replicaId=99, leaderId=98, fetcherId=0] Based on follower's leader epoch, 
leader replied with an offset 128406503 >= the follower's log end offset 
127527919 in prod-raw-events-11. No truncation needed.
2018-11-24 22:33:18,008 INFO kafka.log.Log: [Log partition=prod-raw-events-11, 
dir=/kafka/data/sdh] Truncating to 127527919 has no effect as the largest 
offset in the log is 127527918

> Some topics lost / cannot recover their ISR status following broker crash
> -
>
> Key: KAFKA-7417
> URL: https://issues.apache.org/jira/browse/KAFKA-7417
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 1.1.1, 2.0.0
>Reporter: Mikhail Khomenko
>Priority: Major
>
> Hi,
>  we have faced with the next issue - some replicas cannot become in-sync. 
> Distribution of in-sync replicas amongst topics is random. For instance:
> {code:java}
> $ kafka-topics --zookeeper 1.2.3.4:8181 --describe --topic TEST
> Topic:TEST PartitionCount:8 ReplicationFactor:3 Configs:
> Topic: TEST Partition: 0 Leader: 2 Replicas: 0,2,1 Isr: 0,1,2
> Topic: TEST Partition: 1 Leader: 1 Replicas: 1,0,2 Isr: 0,1,2
> Topic: TEST Partition: 2 Leader: 2 Replicas: 2,1,0 Isr: 0,1,2
> Topic: TEST Partition: 3 Leader: 2 Replicas: 0,1,2 Isr: 0,1,2
> Topic: TEST Partition: 4 Leader: 1 Replicas: 1,2,0 Isr: 0,1,2
> Topic: TEST Partition: 5 Leader: 2 Replicas: 2,0,1 Isr: 0,1,2
> Topic: TEST Partition: 6 Leader: 0 Replicas: 0,2,1 Isr: 0,1,2
> Topic: TEST Partition: 7 Leader: 0 Replicas: 1,0,2 Isr: 0,2{code}
> Files in segment TEST-7 are equal (the same md5sum) on all 3 brokers. Also 
> were checked by kafka.tools.DumpLogSegments - messages are the same.
> We have 3-broker cluster configuration with Confluent Kafka 5.0.0 (it's 
> Apache Kafka 2.0.0).
>  Each broker has the next configuration:
> {code:java}
> advertised.host.name = null
> advertised.listeners = PLAINTEXT://1.2.3.4:9200
> advertised.port = null
> alter.config.policy.class.name = null
> alter.log.dirs.replication.quota.window.num = 11
> alter.log.dirs.replication.quota.window.size.seconds = 1
> authorizer.class.name = 
> auto.create.topics.enable = true
> auto.leader.rebalance.enable = true
> background.threads = 10
> broker.id = 1
> broker.id.generation.enable = true
> broker.interceptor.class = class 
> org.apache.kafka.server.interceptor.DefaultBrokerInterceptor
> broker.rack = null
> client.quota.callback.class = null
> compression.type = producer
> connections.max.idle.ms = 60
> controlled.shutdown.enable = true
> controlled.shutdown.max.retries = 3
> controlled.shutdown.retry.backoff.ms = 5000
> controller.socket.timeout.ms = 3
> create.topic.policy.class.name = null
> default.replication.factor = 3
> delegation.token.expiry.check.interval.ms = 360
> delegation.token.expiry.time.ms = 8640
> delegation.token.master.key = null
> delegation.token.max.lifetime.ms = 60480
> delete.records.purgatory.purge.interval.requests = 1
> delete.topic.enable = true
> fetch.purgatory.purge.interval.requests = 1000
> group.initial.rebalance.delay.ms = 3000
> group.max.session.timeout.ms = 30
> group.min.session.timeout.ms = 6000
> host.name = 
> inter.broker.listener.name = null
> inter.broker.protocol.version = 2.0
> leader.imbalance.check.interval.seconds = 300
> leader.imbalance.per.broker.percentage = 10
> listener.security.protocol.map = 
> PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> listeners = PLAINTEXT://0.0.0.0:9200
> log.cleaner.backoff.ms = 15000
> log.cleaner.dedupe.buffer.size = 134217728
> log.cleaner.delete.retention.ms = 8640
> log.cleaner.enable = true
> log.cleaner.io.buffer.load.factor = 0.9
> log.cleaner.io.buffer.size = 524288
> log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
> log.cleaner.min.cleanable.ratio = 0.5
> log.cleaner.min.compaction.lag.ms = 0
> log.cleaner.threads = 1
> log.cleanup.policy = [delete]
> log.dir = /tmp/kafka-logs
> log.dirs = /var/lib/kafka/data
> log.flush.interval.messages = 9223372036854775807
> log.flush.interval.ms = null
> log.flush.offset.checkpoint.interval.ms = 6
> log.flush.scheduler.interval.ms = 9223372036854775807
> log.flush.start.offset.checkpoint.interval.ms = 6
> log.index.interval.bytes = 4096
> log.index.size.max.bytes = 10485760
> log.message.downconversion.enable = true
> log.message.format.version

[jira] [Commented] (KAFKA-7417) Some topics lost / cannot recover their ISR status following broker crash

2018-09-28 Thread Mikhail Khomenko (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631542#comment-16631542
 ] 

Mikhail Khomenko commented on KAFKA-7417:
-

Currently this issue was fixed by adding a new broker to cluster:
 * new broker was running
 *  all partitions were manually rebalanced after it (according to this manual 
- [https://svn.apache.org/repos/asf/kafka/site/082/ops.html)]
 * currently there are NO under-replicated-partitions
 * statistics of topics/partitions:
topics = 1113, partitions = 8981

- notice: during rebalancing RF became 4, then automatically returned to RF=3

> Some topics lost / cannot recover their ISR status following broker crash
> -
>
> Key: KAFKA-7417
> URL: https://issues.apache.org/jira/browse/KAFKA-7417
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 1.1.1, 2.0.0
>Reporter: Mikhail Khomenko
>Priority: Major
>
> Hi,
>  we have faced with the next issue - some replicas cannot become in-sync. 
> Distribution of in-sync replicas amongst topics is random. For instance:
> {code:java}
> $ kafka-topics --zookeeper 1.2.3.4:8181 --describe --topic TEST
> Topic:TEST PartitionCount:8 ReplicationFactor:3 Configs:
> Topic: TEST Partition: 0 Leader: 2 Replicas: 0,2,1 Isr: 0,1,2
> Topic: TEST Partition: 1 Leader: 1 Replicas: 1,0,2 Isr: 0,1,2
> Topic: TEST Partition: 2 Leader: 2 Replicas: 2,1,0 Isr: 0,1,2
> Topic: TEST Partition: 3 Leader: 2 Replicas: 0,1,2 Isr: 0,1,2
> Topic: TEST Partition: 4 Leader: 1 Replicas: 1,2,0 Isr: 0,1,2
> Topic: TEST Partition: 5 Leader: 2 Replicas: 2,0,1 Isr: 0,1,2
> Topic: TEST Partition: 6 Leader: 0 Replicas: 0,2,1 Isr: 0,1,2
> Topic: TEST Partition: 7 Leader: 0 Replicas: 1,0,2 Isr: 0,2{code}
> Files in segment TEST-7 are equal (the same md5sum) on all 3 brokers. Also 
> were checked by kafka.tools.DumpLogSegments - messages are the same.
> We have 3-broker cluster configuration with Confluent Kafka 5.0.0 (it's 
> Apache Kafka 2.0.0).
>  Each broker has the next configuration:
> {code:java}
> advertised.host.name = null
> advertised.listeners = PLAINTEXT://1.2.3.4:9200
> advertised.port = null
> alter.config.policy.class.name = null
> alter.log.dirs.replication.quota.window.num = 11
> alter.log.dirs.replication.quota.window.size.seconds = 1
> authorizer.class.name = 
> auto.create.topics.enable = true
> auto.leader.rebalance.enable = true
> background.threads = 10
> broker.id = 1
> broker.id.generation.enable = true
> broker.interceptor.class = class 
> org.apache.kafka.server.interceptor.DefaultBrokerInterceptor
> broker.rack = null
> client.quota.callback.class = null
> compression.type = producer
> connections.max.idle.ms = 60
> controlled.shutdown.enable = true
> controlled.shutdown.max.retries = 3
> controlled.shutdown.retry.backoff.ms = 5000
> controller.socket.timeout.ms = 3
> create.topic.policy.class.name = null
> default.replication.factor = 3
> delegation.token.expiry.check.interval.ms = 360
> delegation.token.expiry.time.ms = 8640
> delegation.token.master.key = null
> delegation.token.max.lifetime.ms = 60480
> delete.records.purgatory.purge.interval.requests = 1
> delete.topic.enable = true
> fetch.purgatory.purge.interval.requests = 1000
> group.initial.rebalance.delay.ms = 3000
> group.max.session.timeout.ms = 30
> group.min.session.timeout.ms = 6000
> host.name = 
> inter.broker.listener.name = null
> inter.broker.protocol.version = 2.0
> leader.imbalance.check.interval.seconds = 300
> leader.imbalance.per.broker.percentage = 10
> listener.security.protocol.map = 
> PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> listeners = PLAINTEXT://0.0.0.0:9200
> log.cleaner.backoff.ms = 15000
> log.cleaner.dedupe.buffer.size = 134217728
> log.cleaner.delete.retention.ms = 8640
> log.cleaner.enable = true
> log.cleaner.io.buffer.load.factor = 0.9
> log.cleaner.io.buffer.size = 524288
> log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
> log.cleaner.min.cleanable.ratio = 0.5
> log.cleaner.min.compaction.lag.ms = 0
> log.cleaner.threads = 1
> log.cleanup.policy = [delete]
> log.dir = /tmp/kafka-logs
> log.dirs = /var/lib/kafka/data
> log.flush.interval.messages = 9223372036854775807
> log.flush.interval.ms = null
> log.flush.offset.checkpoint.interval.ms = 6
> log.flush.scheduler.interval.ms = 9223372036854775807
> log.flush.start.offset.checkpoint.interval.ms = 6
> log.index.interval.bytes = 4096
> log.index.size.max.bytes = 10485760
> log.message.downconversion.enable = true
> log.message.format.version = 2.0
> log.message.timestamp.difference.max.ms = 9223372036854775807
> log.message.timestamp.type = CreateTime
> log.preallocate = false
> log.retention.bytes = -1
> log.retention.che