Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors

2016-07-11 Thread Hao Xu
Oh, I am talking about another memory leak. the offheap memory leak we had
experienced. Which is about Direct Buffer memory. the callstack as below.
 ReplicaFetcherThread.warn - [ReplicaFetcherThread-4-1463989770], Error in
fetch kafka.server.ReplicaFetcherThread$FetchRequest@7f4c1657. Possible
cause: java.lang.OutOfMemoryError: Direct buffer memory
 ERROR kafka-network-thread-75737866-PLAINTEXT-26 Processor.error -
Processor got uncaught exception.
local3.err java.lang.OutOfMemoryError: - Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)

On Mon, Jul 11, 2016 at 10:54 AM, Tom Crayford  wrote:

> Hi (I'm the author of that ticket):
>
> From my understanding limiting MaxDirectMemory won't workaround this memory
> leak. The leak is inside the JVM's implementation, not in normal direct
> buffers. On one of our brokers with this issue, we're seeing the JVM report
> 1.2GB of heap, and 128MB of offheap memory, yet the actual process memory
> is more like 10GB.
>
> Thanks
>
> Tom Crayford
> Heroku Kafka
>



-- 
-hao


Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors

2016-07-11 Thread Tom Crayford
Hi (I'm the author of that ticket):

>From my understanding limiting MaxDirectMemory won't workaround this memory
leak. The leak is inside the JVM's implementation, not in normal direct
buffers. On one of our brokers with this issue, we're seeing the JVM report
1.2GB of heap, and 128MB of offheap memory, yet the actual process memory
is more like 10GB.

Thanks

Tom Crayford
Heroku Kafka


Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors

2016-07-11 Thread feifei hsu
please refer (KAFKA-3933)
a workaround is -XX:MaxDirectMemorySize=1024m
if your callstack has direct buffer issues.(effectively off heap memory)

On Wed, May 11, 2016 at 9:50 AM, Russ Lavoie  wrote:

> Good Afternoon,
>
> I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1
> and am running into a problem when starting 0.9.0.1 with the protocol
> version 0.8.2.1 set in the server.properties.
>
> Here is my current Kafka topic setup, data retention and hardware used:
>
> 3 Zookeeper nodes
> 5 Broker nodes
> Topics have at least 2 replicas
> Topics have no more than 200 partitions
> 4,564 partitions across 61 topics
> 14 day retention
> Each Kafka node has between 2.1T - 2.9T of data
> Hardware is C4.2xlarge AWS instances
>  - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz)
>  - 14G Ram
>  - 4TB EBS volume (10k IOPS [never gets maxed unless I up the
> num.io.threads])
>
> Here is my running broker configuration for 0.9.0.1:
> 
> [2016-05-11 11:43:58,172] INFO KafkaConfig values:
> advertised.host.name = server.domain
> metric.reporters = []
> quota.producer.default = 9223372036854775807
> offsets.topic.num.partitions = 150
> log.flush.interval.messages = 9223372036854775807
> auto.create.topics.enable = false
> controller.socket.timeout.ms = 3
> log.flush.interval.ms = 1000
> principal.builder.class = class
> org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
> replica.socket.receive.buffer.bytes = 65536
> min.insync.replicas = 1
> replica.fetch.wait.max.ms = 500
> num.recovery.threads.per.data.dir = 1
> ssl.keystore.type = JKS
> default.replication.factor = 3
> ssl.truststore.password = null
> log.preallocate = false
> sasl.kerberos.principal.to.local.rules = [DEFAULT]
> fetch.purgatory.purge.interval.requests = 1000
> ssl.endpoint.identification.algorithm = null
> replica.socket.timeout.ms = 3
> message.max.bytes = 10485760
> num.io.threads =8
> offsets.commit.required.acks = -1
> log.flush.offset.checkpoint.interval.ms = 6
> delete.topic.enable = true
> quota.window.size.seconds = 1
> ssl.truststore.type = JKS
> offsets.commit.timeout.ms = 5000
> quota.window.num = 11
> zookeeper.connect = zkserver:2181/kafka
> authorizer.class.name =
> num.replica.fetchers = 8
> log.retention.ms = null
> log.roll.jitter.hours = 0
> log.cleaner.enable = false
> offsets.load.buffer.size = 5242880
> log.cleaner.delete.retention.ms = 8640
> ssl.client.auth = none
> controlled.shutdown.max.retries = 3
> queued.max.requests = 500
> offsets.topic.replication.factor = 3
> log.cleaner.threads = 1
> sasl.kerberos.service.name = null
> sasl.kerberos.ticket.renew.jitter = 0.05
> socket.request.max.bytes = 104857600
> ssl.trustmanager.algorithm = PKIX
> zookeeper.session.timeout.ms = 6000
> log.retention.bytes = -1
> sasl.kerberos.min.time.before.relogin = 6
> zookeeper.set.acl = false
> connections.max.idle.ms = 60
> offsets.retention.minutes = 1440
> replica.fetch.backoff.ms = 1000
> inter.broker.protocol.version = 0.8.2.1
> log.retention.hours = 168
> num.partitions = 16
> broker.id.generation.enable = false
> listeners = null
> ssl.provider = null
> ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
> log.roll.ms = null
> log.flush.scheduler.interval.ms = 9223372036854775807
> ssl.cipher.suites = null
> log.index.size.max.bytes = 10485760
> ssl.keymanager.algorithm = SunX509
> security.inter.broker.protocol = PLAINTEXT
> replica.fetch.max.bytes = 104857600
> advertised.port = null
> log.cleaner.dedupe.buffer.size = 134217728
> replica.high.watermark.checkpoint.interval.ms = 5000
> log.cleaner.io.buffer.size = 524288
> sasl.kerberos.ticket.renew.window.factor = 0.8
> zookeeper.connection.timeout.ms = 6000
> controlled.shutdown.retry.backoff.ms = 5000
> log.roll.hours = 168
> log.cleanup.policy = delete
> host.name =
> log.roll.jitter.ms = null
> max.connections.per.ip = 2147483647
> offsets.topic.segment.bytes = 104857600
> background.threads = 10
> quota.consumer.default = 9223372036854775807
> request.timeout.ms = 3
> log.index.interval.bytes = 4096
> log.dir = /tmp/kafka-logs
> log.segment.bytes = 268435456
> log.cleaner.backoff.ms = 15000
> offset.metadata.max.bytes = 4096
> ssl.truststore.location = null
> group.max.session.timeout.ms = 3
> ssl.keystore.password = null
> zookeeper.sync.time.ms = 2000
> port = 9092
> log.retention.minutes = null
> log.segment.delete.delay.ms = 6
> log.dirs = /mnt/kafka/data
> controlled.shutdown.enable = true
> compression.type = producer
> max.connections.per.ip.overrides =
> sasl.kerberos.kinit.cmd = /usr/bin/kinit
> log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
> auto.leader.rebalance.enable = true
> leader.imbalance.check.interval.seconds = 300
> log.cleaner.min.cleanable.ratio = 0.5
> replica.lag.time.max.ms = 1
> num.network.threads =8
> ssl.key.password = null
> reserved.broker.max.id = 1000
> metrics.num.samples = 2
> socket.send.buffer.bytes = 2097152
> ssl.protocol = TLS
> socket.receive.buffer

Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors

2016-05-11 Thread Russ Lavoie
Good Afternoon,

I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1
and am running into a problem when starting 0.9.0.1 with the protocol
version 0.8.2.1 set in the server.properties.

Here is my current Kafka topic setup, data retention and hardware used:

3 Zookeeper nodes
5 Broker nodes
Topics have at least 2 replicas
Topics have no more than 200 partitions
4,564 partitions across 61 topics
14 day retention
Each Kafka node has between 2.1T - 2.9T of data
Hardware is C4.2xlarge AWS instances
 - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz)
 - 14G Ram
 - 4TB EBS volume (10k IOPS [never gets maxed unless I up the
num.io.threads])

Here is my running broker configuration for 0.9.0.1:

[2016-05-11 11:43:58,172] INFO KafkaConfig values:
advertised.host.name = server.domain
metric.reporters = []
quota.producer.default = 9223372036854775807
offsets.topic.num.partitions = 150
log.flush.interval.messages = 9223372036854775807
auto.create.topics.enable = false
controller.socket.timeout.ms = 3
log.flush.interval.ms = 1000
principal.builder.class = class
org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
replica.socket.receive.buffer.bytes = 65536
min.insync.replicas = 1
replica.fetch.wait.max.ms = 500
num.recovery.threads.per.data.dir = 1
ssl.keystore.type = JKS
default.replication.factor = 3
ssl.truststore.password = null
log.preallocate = false
sasl.kerberos.principal.to.local.rules = [DEFAULT]
fetch.purgatory.purge.interval.requests = 1000
ssl.endpoint.identification.algorithm = null
replica.socket.timeout.ms = 3
message.max.bytes = 10485760
num.io.threads =8
offsets.commit.required.acks = -1
log.flush.offset.checkpoint.interval.ms = 6
delete.topic.enable = true
quota.window.size.seconds = 1
ssl.truststore.type = JKS
offsets.commit.timeout.ms = 5000
quota.window.num = 11
zookeeper.connect = zkserver:2181/kafka
authorizer.class.name =
num.replica.fetchers = 8
log.retention.ms = null
log.roll.jitter.hours = 0
log.cleaner.enable = false
offsets.load.buffer.size = 5242880
log.cleaner.delete.retention.ms = 8640
ssl.client.auth = none
controlled.shutdown.max.retries = 3
queued.max.requests = 500
offsets.topic.replication.factor = 3
log.cleaner.threads = 1
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
socket.request.max.bytes = 104857600
ssl.trustmanager.algorithm = PKIX
zookeeper.session.timeout.ms = 6000
log.retention.bytes = -1
sasl.kerberos.min.time.before.relogin = 6
zookeeper.set.acl = false
connections.max.idle.ms = 60
offsets.retention.minutes = 1440
replica.fetch.backoff.ms = 1000
inter.broker.protocol.version = 0.8.2.1
log.retention.hours = 168
num.partitions = 16
broker.id.generation.enable = false
listeners = null
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
log.roll.ms = null
log.flush.scheduler.interval.ms = 9223372036854775807
ssl.cipher.suites = null
log.index.size.max.bytes = 10485760
ssl.keymanager.algorithm = SunX509
security.inter.broker.protocol = PLAINTEXT
replica.fetch.max.bytes = 104857600
advertised.port = null
log.cleaner.dedupe.buffer.size = 134217728
replica.high.watermark.checkpoint.interval.ms = 5000
log.cleaner.io.buffer.size = 524288
sasl.kerberos.ticket.renew.window.factor = 0.8
zookeeper.connection.timeout.ms = 6000
controlled.shutdown.retry.backoff.ms = 5000
log.roll.hours = 168
log.cleanup.policy = delete
host.name =
log.roll.jitter.ms = null
max.connections.per.ip = 2147483647
offsets.topic.segment.bytes = 104857600
background.threads = 10
quota.consumer.default = 9223372036854775807
request.timeout.ms = 3
log.index.interval.bytes = 4096
log.dir = /tmp/kafka-logs
log.segment.bytes = 268435456
log.cleaner.backoff.ms = 15000
offset.metadata.max.bytes = 4096
ssl.truststore.location = null
group.max.session.timeout.ms = 3
ssl.keystore.password = null
zookeeper.sync.time.ms = 2000
port = 9092
log.retention.minutes = null
log.segment.delete.delay.ms = 6
log.dirs = /mnt/kafka/data
controlled.shutdown.enable = true
compression.type = producer
max.connections.per.ip.overrides =
sasl.kerberos.kinit.cmd = /usr/bin/kinit
log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
auto.leader.rebalance.enable = true
leader.imbalance.check.interval.seconds = 300
log.cleaner.min.cleanable.ratio = 0.5
replica.lag.time.max.ms = 1
num.network.threads =8
ssl.key.password = null
reserved.broker.max.id = 1000
metrics.num.samples = 2
socket.send.buffer.bytes = 2097152
ssl.protocol = TLS
socket.receive.buffer.bytes = 2097152
ssl.keystore.location = null
replica.fetch.min.bytes = 1
unclean.leader.election.enable = false
group.min.session.timeout.ms = 6000
log.cleaner.io.buffer.load.factor = 0.9
offsets.retention.check.interval.ms = 60
producer.purgatory.purge.interval.requests = 1000
metrics.sample.window.ms = 3
broker.id = 2
offsets.topic.compression.codec = 0
log.retention.check.interval.ms = 30
advertised.listeners = null
leader.imbalance.per.broker.percenta