Re: Unavailable Partitions and Uneven ISR

2016-06-02 Thread Russ Lavoie
Have you verified that the old leader of the partition is using the same
I'd as before?  Check in zk /brokers/ids to get a list of available
brokers.  I would use the reassignment tool to move partition 3 to brokers
in the list from zk (specifying only 3 brokers).  Make sure to include
broker with ID 1 since it was the only ISR in the list before restart.  I
would start there
On Jun 1, 2016 11:40 AM, "Tushar Agrawal"  wrote:

> Hi,
>
> We have 5 brokers running on 0.9.0.1 with 5 ZK. This morning, multiple
> topics were having "unavailable-partitions" (whose leader is not
> available). After looking at multiple logs, forums and google results, we
> finally restarted all the brokers one by one and issue seems to be
> resolved.
>
> However, for that particular partition now we have "five" ISR instead of
> "3".  What should we do to fix this issue?
>
>
> *Before restart*
>
> Topic:topic1 PartitionCount:8 ReplicationFactor:3 Configs:retention.ms
> =25920
> Topic: topic1 Partition: 0 Leader: 0 Replicas: 0,3,4 Isr: 0,3,4
> Topic: topic1 Partition: 1 Leader: 1 Replicas: 1,4,0 Isr: 0,1,4
> Topic: topic1 Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 1,2,0
> Topic: topic1 Partition: 3 Leader: -1 Replicas: 0,1,2,3,4 Isr: 1
> Topic: topic1 Partition: 4 Leader: 4 Replicas: 4,2,3 Isr: 4,3,2
> Topic: topic1 Partition: 5 Leader: 1 Replicas: 0,4,1 Isr: 1,4,0
> Topic: topic1 Partition: 6 Leader: 2 Replicas: 1,0,2 Isr: 2,0,1
> Topic: topic1 Partition: 7 Leader: 3 Replicas: 2,1,3 Isr: 3,2,1
>
> *After restart*
>
> Topic:topic1 PartitionCount:8 ReplicationFactor:3 Configs:retention.ms
> =25920
> Topic: topic1 Partition: 0 Leader: 0 Replicas: 0,3,4 Isr: 0,3,4
> Topic: topic1 Partition: 1 Leader: 1 Replicas: 1,4,0 Isr: 0,1,4
> Topic: topic1 Partition: 2 Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
> Topic: topic1 Partition: 3 Leader: 0 Replicas: 0,1,2,3,4 Isr: 0,1,2,3,4
> Topic: topic1 Partition: 4 Leader: 4 Replicas: 4,2,3 Isr: 3,4,2
> Topic: topic1 Partition: 5 Leader: 0 Replicas: 0,4,1 Isr: 0,1,4
> Topic: topic1 Partition: 6 Leader: 1 Replicas: 1,0,2 Isr: 0,1,2
> Topic: topic1 Partition: 7 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3
>
> Thank you,
> Tushar
>


Re: broker randomly shuts down

2016-06-02 Thread Russ Lavoie
What about in dmesg?  I have run into this issue and it was the OOM
killer.  I also ran into a heap issue using too much of the direct memory
(JVM).  Reducing the fetcher threads helped with that problem.
On Jun 2, 2016 12:19 PM, "allen chan"  wrote:

> Hi Tom,
>
> That is one of the first things that i checked. Active memory never goes
> above 50% of overall available. File cache uses the rest of the memory but
> i do not think that causes OOM killer.
> Either way there is no entries in /var/log/messages (centos) to show OOM is
> happening.
>
> Thanks
>
> On Thu, Jun 2, 2016 at 5:36 AM, Tom Crayford  wrote:
>
> > That looks like somebody is killing the process. I'd suspect either the
> > linux OOM killer or something else automatically killing the JVM for some
> > reason.
> >
> > For the OOM killer, assuming you're on ubuntu, it's pretty easy to find
> in
> > /var/log/syslog (depending on your setup). I don't know about other
> > operating systems.
> >
> > On Thu, Jun 2, 2016 at 5:54 AM, allen chan  >
> > wrote:
> >
> > > I have an issue where my brokers would randomly shut itself down.
> > > I turned on debug in log4j.properties but still do not see a reason why
> > the
> > > shutdown is happening.
> > >
> > > Anyone seen this behavior before?
> > >
> > > version 0.10.0
> > > log4j.properties
> > > log4j.rootLogger=DEBUG, kafkaAppender
> > > * I tried TRACE level but i do not see any additional log messages
> > >
> > > snippet of log around shutdown
> > > [2016-06-01 15:11:51,374] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:11:53,376] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:11:55,377] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:11:57,380] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:11:59,383] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:01,386] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:03,389] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:04,121] INFO [Group Metadata Manager on Broker 2]:
> > > Removed 0 expired offsets in 0 milliseconds.
> > > (kafka.coordinator.GroupMetadataManager)
> > > [2016-06-01 15:12:04,121] INFO [Group Metadata Manager on Broker 2]:
> > > Removed 0 expired offsets in 0 milliseconds.
> > > (kafka.coordinator.GroupMetadataManager)
> > > [2016-06-01 15:12:05,390] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:07,393] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:09,396] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:11,399] DEBUG Got ping response for sessionid:
> > > 0x2550a693b470001 after 1ms (org.apache.zookeeper.ClientCnxn)
> > > [2016-06-01 15:12:13,334] INFO [Kafka Server 2], shutting down
> > > (kafka.server.KafkaServer)
> > > [2016-06-01 15:12:13,334] INFO [Kafka Server 2], shutting down
> > > (kafka.server.KafkaServer)
> > > [2016-06-01 15:12:13,336] INFO [Kafka Server 2], Starting controlled
> > > shutdown (kafka.server.KafkaServer)
> > > [2016-06-01 15:12:13,336] INFO [Kafka Server 2], Starting controlled
> > > shutdown (kafka.server.KafkaServer)
> > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name
> > connections-closed:
> > > (org.apache.kafka.common.metrics.Metrics)
> > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name
> > connections-created:
> > > (org.apache.kafka.common.metrics.Metrics)
> > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name
> > bytes-sent-received:
> > > (org.apache.kafka.common.metrics.Metrics)
> > > [2016-06-01 15:12:13,338] DEBUG Added sensor with name bytes-sent:
> > > (org.apache.kafka.common.metrics.Metrics)
> > > [2016-06-01 15:12:13,339] DEBUG Added sensor with name bytes-received:
> > > (org.apache.kafka.common.metrics.Metrics)
> > > [2016-06-01 15:12:13,339] DEBUG Added sensor with name select-time:
> > > (org.apache.kafka.common.metrics.Metrics)
> > >
> > > --
> > > Allen Michael Chan
> > >
> >
>
>
>
> --
> Allen Michael Chan
>


mbeans missing in 0.9.0.1?

2016-05-16 Thread Russ Lavoie
I am using JMX to gather kafka metrics.  It states here
http://docs.confluent.io/1.0/kafka/monitoring.html that they should be
there.  But when I run a jmx client and show beans kafka.consumer and
kafka.producer do not exist.  Is there something special I have to do to
get these metrics?

Thanks


Re: ISR shrinking/expanding problem

2016-05-16 Thread Russ Lavoie
They did not  Even after 1.5 days of waiting...  I had drop everything and
start over because the entire kafka cluster was in an ISR shrink/expand
loop with larger hardware and and lower replica threads.

On Mon, May 16, 2016 at 1:05 PM, Alex Loddengaard  wrote:

> Hi Russ,
>
> They should eventually catch back up and rejoin the ISR. Did they not?
>
> Alex
>
> On Fri, May 13, 2016 at 6:33 PM, Russ Lavoie  wrote:
>
> > Hello,
> >
> > I moved an entire topic from one set of brokers to another set of
> brokers.
> > The network throughput was so high, that they fell behind the leaders and
> > dropped out of the ISR set.  How can I recover from this?
> >
> > Thanks!
> >
>


ISR shrinking/expanding problem

2016-05-13 Thread Russ Lavoie
Hello,

I moved an entire topic from one set of brokers to another set of brokers.
The network throughput was so high, that they fell behind the leaders and
dropped out of the ISR set.  How can I recover from this?

Thanks!


Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors

2016-05-11 Thread Russ Lavoie
Good Afternoon,

I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1
and am running into a problem when starting 0.9.0.1 with the protocol
version 0.8.2.1 set in the server.properties.

Here is my current Kafka topic setup, data retention and hardware used:

3 Zookeeper nodes
5 Broker nodes
Topics have at least 2 replicas
Topics have no more than 200 partitions
4,564 partitions across 61 topics
14 day retention
Each Kafka node has between 2.1T - 2.9T of data
Hardware is C4.2xlarge AWS instances
 - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz)
 - 14G Ram
 - 4TB EBS volume (10k IOPS [never gets maxed unless I up the
num.io.threads])

Here is my running broker configuration for 0.9.0.1:

[2016-05-11 11:43:58,172] INFO KafkaConfig values:
advertised.host.name = server.domain
metric.reporters = []
quota.producer.default = 9223372036854775807
offsets.topic.num.partitions = 150
log.flush.interval.messages = 9223372036854775807
auto.create.topics.enable = false
controller.socket.timeout.ms = 3
log.flush.interval.ms = 1000
principal.builder.class = class
org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
replica.socket.receive.buffer.bytes = 65536
min.insync.replicas = 1
replica.fetch.wait.max.ms = 500
num.recovery.threads.per.data.dir = 1
ssl.keystore.type = JKS
default.replication.factor = 3
ssl.truststore.password = null
log.preallocate = false
sasl.kerberos.principal.to.local.rules = [DEFAULT]
fetch.purgatory.purge.interval.requests = 1000
ssl.endpoint.identification.algorithm = null
replica.socket.timeout.ms = 3
message.max.bytes = 10485760
num.io.threads =8
offsets.commit.required.acks = -1
log.flush.offset.checkpoint.interval.ms = 6
delete.topic.enable = true
quota.window.size.seconds = 1
ssl.truststore.type = JKS
offsets.commit.timeout.ms = 5000
quota.window.num = 11
zookeeper.connect = zkserver:2181/kafka
authorizer.class.name =
num.replica.fetchers = 8
log.retention.ms = null
log.roll.jitter.hours = 0
log.cleaner.enable = false
offsets.load.buffer.size = 5242880
log.cleaner.delete.retention.ms = 8640
ssl.client.auth = none
controlled.shutdown.max.retries = 3
queued.max.requests = 500
offsets.topic.replication.factor = 3
log.cleaner.threads = 1
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
socket.request.max.bytes = 104857600
ssl.trustmanager.algorithm = PKIX
zookeeper.session.timeout.ms = 6000
log.retention.bytes = -1
sasl.kerberos.min.time.before.relogin = 6
zookeeper.set.acl = false
connections.max.idle.ms = 60
offsets.retention.minutes = 1440
replica.fetch.backoff.ms = 1000
inter.broker.protocol.version = 0.8.2.1
log.retention.hours = 168
num.partitions = 16
broker.id.generation.enable = false
listeners = null
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
log.roll.ms = null
log.flush.scheduler.interval.ms = 9223372036854775807
ssl.cipher.suites = null
log.index.size.max.bytes = 10485760
ssl.keymanager.algorithm = SunX509
security.inter.broker.protocol = PLAINTEXT
replica.fetch.max.bytes = 104857600
advertised.port = null
log.cleaner.dedupe.buffer.size = 134217728
replica.high.watermark.checkpoint.interval.ms = 5000
log.cleaner.io.buffer.size = 524288
sasl.kerberos.ticket.renew.window.factor = 0.8
zookeeper.connection.timeout.ms = 6000
controlled.shutdown.retry.backoff.ms = 5000
log.roll.hours = 168
log.cleanup.policy = delete
host.name =
log.roll.jitter.ms = null
max.connections.per.ip = 2147483647
offsets.topic.segment.bytes = 104857600
background.threads = 10
quota.consumer.default = 9223372036854775807
request.timeout.ms = 3
log.index.interval.bytes = 4096
log.dir = /tmp/kafka-logs
log.segment.bytes = 268435456
log.cleaner.backoff.ms = 15000
offset.metadata.max.bytes = 4096
ssl.truststore.location = null
group.max.session.timeout.ms = 3
ssl.keystore.password = null
zookeeper.sync.time.ms = 2000
port = 9092
log.retention.minutes = null
log.segment.delete.delay.ms = 6
log.dirs = /mnt/kafka/data
controlled.shutdown.enable = true
compression.type = producer
max.connections.per.ip.overrides =
sasl.kerberos.kinit.cmd = /usr/bin/kinit
log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
auto.leader.rebalance.enable = true
leader.imbalance.check.interval.seconds = 300
log.cleaner.min.cleanable.ratio = 0.5
replica.lag.time.max.ms = 1
num.network.threads =8
ssl.key.password = null
reserved.broker.max.id = 1000
metrics.num.samples = 2
socket.send.buffer.bytes = 2097152
ssl.protocol = TLS
socket.receive.buffer.bytes = 2097152
ssl.keystore.location = null
replica.fetch.min.bytes = 1
unclean.leader.election.enable = false
group.min.session.timeout.ms = 6000
log.cleaner.io.buffer.load.factor = 0.9
offsets.retention.check.interval.ms = 60
producer.purgatory.purge.interval.requests = 1000
metrics.sample.window.ms = 3
broker.id = 2
offsets.topic.compression.codec = 0
log.retention.check.interval.ms = 30
advertised.listeners = null
leader.imbalance.per.broker.percenta