Re: Unavailable Partitions and Uneven ISR
Have you verified that the old leader of the partition is using the same I'd as before? Check in zk /brokers/ids to get a list of available brokers. I would use the reassignment tool to move partition 3 to brokers in the list from zk (specifying only 3 brokers). Make sure to include broker with ID 1 since it was the only ISR in the list before restart. I would start there On Jun 1, 2016 11:40 AM, "Tushar Agrawal" wrote: > Hi, > > We have 5 brokers running on 0.9.0.1 with 5 ZK. This morning, multiple > topics were having "unavailable-partitions" (whose leader is not > available). After looking at multiple logs, forums and google results, we > finally restarted all the brokers one by one and issue seems to be > resolved. > > However, for that particular partition now we have "five" ISR instead of > "3". What should we do to fix this issue? > > > *Before restart* > > Topic:topic1 PartitionCount:8 ReplicationFactor:3 Configs:retention.ms > =25920 > Topic: topic1 Partition: 0 Leader: 0 Replicas: 0,3,4 Isr: 0,3,4 > Topic: topic1 Partition: 1 Leader: 1 Replicas: 1,4,0 Isr: 0,1,4 > Topic: topic1 Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 1,2,0 > Topic: topic1 Partition: 3 Leader: -1 Replicas: 0,1,2,3,4 Isr: 1 > Topic: topic1 Partition: 4 Leader: 4 Replicas: 4,2,3 Isr: 4,3,2 > Topic: topic1 Partition: 5 Leader: 1 Replicas: 0,4,1 Isr: 1,4,0 > Topic: topic1 Partition: 6 Leader: 2 Replicas: 1,0,2 Isr: 2,0,1 > Topic: topic1 Partition: 7 Leader: 3 Replicas: 2,1,3 Isr: 3,2,1 > > *After restart* > > Topic:topic1 PartitionCount:8 ReplicationFactor:3 Configs:retention.ms > =25920 > Topic: topic1 Partition: 0 Leader: 0 Replicas: 0,3,4 Isr: 0,3,4 > Topic: topic1 Partition: 1 Leader: 1 Replicas: 1,4,0 Isr: 0,1,4 > Topic: topic1 Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1 > Topic: topic1 Partition: 3 Leader: 0 Replicas: 0,1,2,3,4 Isr: 0,1,2,3,4 > Topic: topic1 Partition: 4 Leader: 4 Replicas: 4,2,3 Isr: 3,4,2 > Topic: topic1 Partition: 5 Leader: 0 Replicas: 0,4,1 Isr: 0,1,4 > Topic: topic1 Partition: 6 Leader: 1 Replicas: 1,0,2 Isr: 0,1,2 > Topic: topic1 Partition: 7 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3 > > Thank you, > Tushar >
Re: broker randomly shuts down
What about in dmesg? I have run into this issue and it was the OOM killer. I also ran into a heap issue using too much of the direct memory (JVM). Reducing the fetcher threads helped with that problem. On Jun 2, 2016 12:19 PM, "allen chan" wrote: > Hi Tom, > > That is one of the first things that i checked. Active memory never goes > above 50% of overall available. File cache uses the rest of the memory but > i do not think that causes OOM killer. > Either way there is no entries in /var/log/messages (centos) to show OOM is > happening. > > Thanks > > On Thu, Jun 2, 2016 at 5:36 AM, Tom Crayford wrote: > > > That looks like somebody is killing the process. I'd suspect either the > > linux OOM killer or something else automatically killing the JVM for some > > reason. > > > > For the OOM killer, assuming you're on ubuntu, it's pretty easy to find > in > > /var/log/syslog (depending on your setup). I don't know about other > > operating systems. > > > > On Thu, Jun 2, 2016 at 5:54 AM, allen chan > > > wrote: > > > > > I have an issue where my brokers would randomly shut itself down. > > > I turned on debug in log4j.properties but still do not see a reason why > > the > > > shutdown is happening. > > > > > > Anyone seen this behavior before? > > > > > > version 0.10.0 > > > log4j.properties > > > log4j.rootLogger=DEBUG, kafkaAppender > > > * I tried TRACE level but i do not see any additional log messages > > > > > > snippet of log around shutdown > > > [2016-06-01 15:11:51,374] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:11:53,376] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:11:55,377] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:11:57,380] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:11:59,383] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:01,386] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:03,389] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:04,121] INFO [Group Metadata Manager on Broker 2]: > > > Removed 0 expired offsets in 0 milliseconds. > > > (kafka.coordinator.GroupMetadataManager) > > > [2016-06-01 15:12:04,121] INFO [Group Metadata Manager on Broker 2]: > > > Removed 0 expired offsets in 0 milliseconds. > > > (kafka.coordinator.GroupMetadataManager) > > > [2016-06-01 15:12:05,390] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:07,393] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:09,396] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:11,399] DEBUG Got ping response for sessionid: > > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn) > > > [2016-06-01 15:12:13,334] INFO [Kafka Server 2], shutting down > > > (kafka.server.KafkaServer) > > > [2016-06-01 15:12:13,334] INFO [Kafka Server 2], shutting down > > > (kafka.server.KafkaServer) > > > [2016-06-01 15:12:13,336] INFO [Kafka Server 2], Starting controlled > > > shutdown (kafka.server.KafkaServer) > > > [2016-06-01 15:12:13,336] INFO [Kafka Server 2], Starting controlled > > > shutdown (kafka.server.KafkaServer) > > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name > > connections-closed: > > > (org.apache.kafka.common.metrics.Metrics) > > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name > > connections-created: > > > (org.apache.kafka.common.metrics.Metrics) > > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name > > bytes-sent-received: > > > (org.apache.kafka.common.metrics.Metrics) > > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name bytes-sent: > > > (org.apache.kafka.common.metrics.Metrics) > > > [2016-06-01 15:12:13,339] DEBUG Added sensor with name bytes-received: > > > (org.apache.kafka.common.metrics.Metrics) > > > [2016-06-01 15:12:13,339] DEBUG Added sensor with name select-time: > > > (org.apache.kafka.common.metrics.Metrics) > > > > > > -- > > > Allen Michael Chan > > > > > > > > > -- > Allen Michael Chan >
mbeans missing in 0.9.0.1?
I am using JMX to gather kafka metrics. It states here http://docs.confluent.io/1.0/kafka/monitoring.html that they should be there. But when I run a jmx client and show beans kafka.consumer and kafka.producer do not exist. Is there something special I have to do to get these metrics? Thanks
Re: ISR shrinking/expanding problem
They did not Even after 1.5 days of waiting... I had drop everything and start over because the entire kafka cluster was in an ISR shrink/expand loop with larger hardware and and lower replica threads. On Mon, May 16, 2016 at 1:05 PM, Alex Loddengaard wrote: > Hi Russ, > > They should eventually catch back up and rejoin the ISR. Did they not? > > Alex > > On Fri, May 13, 2016 at 6:33 PM, Russ Lavoie wrote: > > > Hello, > > > > I moved an entire topic from one set of brokers to another set of > brokers. > > The network throughput was so high, that they fell behind the leaders and > > dropped out of the ISR set. How can I recover from this? > > > > Thanks! > > >
ISR shrinking/expanding problem
Hello, I moved an entire topic from one set of brokers to another set of brokers. The network throughput was so high, that they fell behind the leaders and dropped out of the ISR set. How can I recover from this? Thanks!
Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors
Good Afternoon, I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1 and am running into a problem when starting 0.9.0.1 with the protocol version 0.8.2.1 set in the server.properties. Here is my current Kafka topic setup, data retention and hardware used: 3 Zookeeper nodes 5 Broker nodes Topics have at least 2 replicas Topics have no more than 200 partitions 4,564 partitions across 61 topics 14 day retention Each Kafka node has between 2.1T - 2.9T of data Hardware is C4.2xlarge AWS instances - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz) - 14G Ram - 4TB EBS volume (10k IOPS [never gets maxed unless I up the num.io.threads]) Here is my running broker configuration for 0.9.0.1: [2016-05-11 11:43:58,172] INFO KafkaConfig values: advertised.host.name = server.domain metric.reporters = [] quota.producer.default = 9223372036854775807 offsets.topic.num.partitions = 150 log.flush.interval.messages = 9223372036854775807 auto.create.topics.enable = false controller.socket.timeout.ms = 3 log.flush.interval.ms = 1000 principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder replica.socket.receive.buffer.bytes = 65536 min.insync.replicas = 1 replica.fetch.wait.max.ms = 500 num.recovery.threads.per.data.dir = 1 ssl.keystore.type = JKS default.replication.factor = 3 ssl.truststore.password = null log.preallocate = false sasl.kerberos.principal.to.local.rules = [DEFAULT] fetch.purgatory.purge.interval.requests = 1000 ssl.endpoint.identification.algorithm = null replica.socket.timeout.ms = 3 message.max.bytes = 10485760 num.io.threads =8 offsets.commit.required.acks = -1 log.flush.offset.checkpoint.interval.ms = 6 delete.topic.enable = true quota.window.size.seconds = 1 ssl.truststore.type = JKS offsets.commit.timeout.ms = 5000 quota.window.num = 11 zookeeper.connect = zkserver:2181/kafka authorizer.class.name = num.replica.fetchers = 8 log.retention.ms = null log.roll.jitter.hours = 0 log.cleaner.enable = false offsets.load.buffer.size = 5242880 log.cleaner.delete.retention.ms = 8640 ssl.client.auth = none controlled.shutdown.max.retries = 3 queued.max.requests = 500 offsets.topic.replication.factor = 3 log.cleaner.threads = 1 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 socket.request.max.bytes = 104857600 ssl.trustmanager.algorithm = PKIX zookeeper.session.timeout.ms = 6000 log.retention.bytes = -1 sasl.kerberos.min.time.before.relogin = 6 zookeeper.set.acl = false connections.max.idle.ms = 60 offsets.retention.minutes = 1440 replica.fetch.backoff.ms = 1000 inter.broker.protocol.version = 0.8.2.1 log.retention.hours = 168 num.partitions = 16 broker.id.generation.enable = false listeners = null ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] log.roll.ms = null log.flush.scheduler.interval.ms = 9223372036854775807 ssl.cipher.suites = null log.index.size.max.bytes = 10485760 ssl.keymanager.algorithm = SunX509 security.inter.broker.protocol = PLAINTEXT replica.fetch.max.bytes = 104857600 advertised.port = null log.cleaner.dedupe.buffer.size = 134217728 replica.high.watermark.checkpoint.interval.ms = 5000 log.cleaner.io.buffer.size = 524288 sasl.kerberos.ticket.renew.window.factor = 0.8 zookeeper.connection.timeout.ms = 6000 controlled.shutdown.retry.backoff.ms = 5000 log.roll.hours = 168 log.cleanup.policy = delete host.name = log.roll.jitter.ms = null max.connections.per.ip = 2147483647 offsets.topic.segment.bytes = 104857600 background.threads = 10 quota.consumer.default = 9223372036854775807 request.timeout.ms = 3 log.index.interval.bytes = 4096 log.dir = /tmp/kafka-logs log.segment.bytes = 268435456 log.cleaner.backoff.ms = 15000 offset.metadata.max.bytes = 4096 ssl.truststore.location = null group.max.session.timeout.ms = 3 ssl.keystore.password = null zookeeper.sync.time.ms = 2000 port = 9092 log.retention.minutes = null log.segment.delete.delay.ms = 6 log.dirs = /mnt/kafka/data controlled.shutdown.enable = true compression.type = producer max.connections.per.ip.overrides = sasl.kerberos.kinit.cmd = /usr/bin/kinit log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308 auto.leader.rebalance.enable = true leader.imbalance.check.interval.seconds = 300 log.cleaner.min.cleanable.ratio = 0.5 replica.lag.time.max.ms = 1 num.network.threads =8 ssl.key.password = null reserved.broker.max.id = 1000 metrics.num.samples = 2 socket.send.buffer.bytes = 2097152 ssl.protocol = TLS socket.receive.buffer.bytes = 2097152 ssl.keystore.location = null replica.fetch.min.bytes = 1 unclean.leader.election.enable = false group.min.session.timeout.ms = 6000 log.cleaner.io.buffer.load.factor = 0.9 offsets.retention.check.interval.ms = 60 producer.purgatory.purge.interval.requests = 1000 metrics.sample.window.ms = 3 broker.id = 2 offsets.topic.compression.codec = 0 log.retention.check.interval.ms = 30 advertised.listeners = null leader.imbalance.per.broker.percenta