[
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitay Kufert updated KAFKA-9824:
--------------------------------
Description:
Hello,
around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from
2.3.1),
and we started noticing a troubling behavior that we didn't see before:
Without apparent reason, a specific partition on a specific consumer loses its
offset and start re-consuming the entire partition from the beginning
(according to the retention).
Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer
clientId=consumer-fireAttributionConsumerGroup4-2,
groupId=fireAttributionConsumerGroup4] Resetting offset for partition
trackingSolutionAttribution-48 to offset 1216430527.
{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer
clientId=consumer-fireAttributionConsumerGroup4-2,
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range
for partition trackingSolutionAttribution-48
{quote}
Those are the logs from the brokers at the same time (searched for
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset
1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in
0 ms.
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
groups
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
groups
{quote}
Another way to see the same thing, from our monitoring (DD) on the partition
offset:
The recovery you !image-2020-04-06-13-14-47-014.png! are seeing is after I run
partition offset reset manually (using kafka-consumer-groups.sh
--bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57
--group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE')
Any idea what can be causing this? we have it happen to us at least 5 times
since the upgrade, and before that, I don't remember it ever happening to us.
Topic config is set to default, except the retention, which is manually set to
43200000.
The topic has 60 partitions & a replication factor of 2.
Consumer config:
{code:java}
ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [..........]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = fireAttributionConsumerGroup4
group.instance.id = null
heartbeat.interval.ms = 10000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class
org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 30000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class
org.apache.kafka.common.serialization.StringDeserializer
{code}
Broker config (server.properties) - We have 9 brokers:
{code:java}
num.io.threads=4
num.network.threads=64
socket.send.buffer.bytes=33554432
socket.receive.buffer.bytes=33554432
socket.request.max.bytes=104857600
num.partitions=2
log.retention.hours=1
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
zookeeper.connection.timeout.ms=1000000
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.1
log.cleaner.threads=2
log.cleanup.policy=delete
log.cleaner.delete.retention.ms=86400000
log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=9223372036854775807
message.max.bytes=10000000
replica.fetch.max.bytes=10000000
default.replication.factor=2
delete.topic.enable=true
unclean.leader.election.enable=false
compression.type=snappy
inter.broker.protocol.version=2.4
{code}
Let me know if you need more data from me
Thanks
was:
Hello,
around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from
2.3.1),
and we started noticing a troubling behavior that we didn't see before:
Without apparent reason, a specific partition on a specific consumer loses its
offset and start re-consuming the entire partition from the beginning
(according to the retention).
Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer
clientId=consumer-fireAttributionConsumerGroup4-2,
groupId=fireAttributionConsumerGroup4] Resetting offset for partition
trackingSolutionAttribution-48 to offset 1216430527.{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer
clientId=consumer-fireAttributionConsumerGroup4-2,
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range
for partition trackingSolutionAttribution-48{quote}
Those are the logs from the brokers at the same time (searched for
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset
1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in
0 ms.
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
groups
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
groups{quote}
Another way to see the same thing, from our monitoring (DD) on the partition
offset:
!https://mail.google.com/mail/u/0?ui=2&ik=8f9b1ec48a&attid=0.1&permmsgid=msg-a:r-1797075611836356806&th=1714b5531deb08d9&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_EnC23byd8TemOhOsmVTdpfogSBTeh45zFq4EVB1OvXnSZeLO0yepieyKAm8OoIarKz6qGYuKh9Pp2Ck7CmUFvZj4LljcYfzbmsdMF3LYaN93F6aUIH0l1bMA&disp=emb&realattid=ii_k8n355r50|width=542,height=158!
The recovery you are seeing is after I run partition offset reset manually
(using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic
trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4
--reset-offsets --to-datetime 'SOME DATE')
Any idea what can be causing this? we have it happen to us at least 5 times
since the upgrade, and before that, I don't remember it ever happening to us.
Topic config is set to default, except the retention, which is manually set to
43200000.
The topic has 60 partitions & a replication factor of 2.
Consumer config:
{code:java}
ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [..........]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = fireAttributionConsumerGroup4
group.instance.id = null
heartbeat.interval.ms = 10000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class
org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 30000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class
org.apache.kafka.common.serialization.StringDeserializer
{code}
Broker config (server.properties) - We have 9 brokers:
{code:java}
num.io.threads=4
num.network.threads=64
socket.send.buffer.bytes=33554432
socket.receive.buffer.bytes=33554432
socket.request.max.bytes=104857600
num.partitions=2
log.retention.hours=1
log.segment.bytes=536870912
log.retention.check.interval.ms=60000
zookeeper.connection.timeout.ms=1000000
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.1
log.cleaner.threads=2
log.cleanup.policy=delete
log.cleaner.delete.retention.ms=86400000
log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=9223372036854775807
message.max.bytes=10000000
replica.fetch.max.bytes=10000000
default.replication.factor=2
delete.topic.enable=true
unclean.leader.election.enable=false
compression.type=snappy
inter.broker.protocol.version=2.4
{code}
Let me know if you need more data from me
Thanks
> Consumer loses partition offset and resets post 2.4.1 version upgrade
> ---------------------------------------------------------------------
>
> Key: KAFKA-9824
> URL: https://issues.apache.org/jira/browse/KAFKA-9824
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 2.4.1
> Reporter: Nitay Kufert
> Priority: Major
> Attachments: image-2020-04-06-13-14-47-014.png
>
>
> Hello,
> around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from
> 2.3.1),
> and we started noticing a troubling behavior that we didn't see before:
>
> Without apparent reason, a specific partition on a specific consumer loses
> its offset and start re-consuming the entire partition from the beginning
> (according to the retention).
>
> Messages appearing on the consumer (client):
> {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer
> clientId=consumer-fireAttributionConsumerGroup4-2,
> groupId=fireAttributionConsumerGroup4] Resetting offset for partition
> trackingSolutionAttribution-48 to offset 1216430527.
> {quote}
> {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer
> clientId=consumer-fireAttributionConsumerGroup4-2,
> groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of
> range for partition trackingSolutionAttribution-48
> {quote}
> Those are the logs from the brokers at the same time (searched for
> "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
> {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset
> 1222791065
>
> Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset
> 1222791065
>
> Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065
> in 0 ms.
>
> Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
> groups
>
> Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive
> groups
> {quote}
> Another way to see the same thing, from our monitoring (DD) on the partition
> offset:
> The recovery you !image-2020-04-06-13-14-47-014.png! are seeing is after I
> run partition offset reset manually (using kafka-consumer-groups.sh
> --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57
> --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME
> DATE')
>
> Any idea what can be causing this? we have it happen to us at least 5 times
> since the upgrade, and before that, I don't remember it ever happening to us.
>
> Topic config is set to default, except the retention, which is manually set
> to 43200000.
> The topic has 60 partitions & a replication factor of 2.
>
> Consumer config:
> {code:java}
> ConsumerConfig values:
> allow.auto.create.topics = true
> auto.commit.interval.ms = 5000
> auto.offset.reset = earliest
> bootstrap.servers = [..........]
> check.crcs = true
> client.dns.lookup = default
> client.id =
> client.rack =
> connections.max.idle.ms = 540000
> default.api.timeout.ms = 60000
> enable.auto.commit = true
> exclude.internal.topics = true
> fetch.max.bytes = 52428800
> fetch.max.wait.ms = 500
> fetch.min.bytes = 1
> group.id = fireAttributionConsumerGroup4
> group.instance.id = null
> heartbeat.interval.ms = 10000
> interceptor.classes = []
> internal.leave.group.on.close = true
> isolation.level = read_uncommitted
> key.deserializer = class
> org.apache.kafka.common.serialization.ByteArrayDeserializer
> max.partition.fetch.bytes = 1048576
> max.poll.interval.ms = 300000
> max.poll.records = 500
> metadata.max.age.ms = 300000
> metric.reporters = []
> metrics.num.samples = 2
> metrics.recording.level = INFO
> metrics.sample.window.ms = 30000
> partition.assignment.strategy = [class
> org.apache.kafka.clients.consumer.RangeAssignor]
> receive.buffer.bytes = 65536
> reconnect.backoff.max.ms = 1000
> reconnect.backoff.ms = 50
> request.timeout.ms = 30000
> retry.backoff.ms = 100
> sasl.client.callback.handler.class = null
> sasl.jaas.config = null
> sasl.kerberos.kinit.cmd = /usr/bin/kinit
> sasl.kerberos.min.time.before.relogin = 60000
> sasl.kerberos.service.name = null
> sasl.kerberos.ticket.renew.jitter = 0.05
> sasl.kerberos.ticket.renew.window.factor = 0.8
> sasl.login.callback.handler.class = null
> sasl.login.class = null
> sasl.login.refresh.buffer.seconds = 300
> sasl.login.refresh.min.period.seconds = 60
> sasl.login.refresh.window.factor = 0.8
> sasl.login.refresh.window.jitter = 0.05
> sasl.mechanism = GSSAPI
> security.protocol = PLAINTEXT
> security.providers = null
> send.buffer.bytes = 131072
> session.timeout.ms = 30000
> ssl.cipher.suites = null
> ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
> ssl.endpoint.identification.algorithm = https
> ssl.key.password = null
> ssl.keymanager.algorithm = SunX509
> ssl.keystore.location = null
> ssl.keystore.password = null
> ssl.keystore.type = JKS
> ssl.protocol = TLS
> ssl.provider = null
> ssl.secure.random.implementation = null
> ssl.trustmanager.algorithm = PKIX
> ssl.truststore.location = null
> ssl.truststore.password = null
> ssl.truststore.type = JKS
> value.deserializer = class
> org.apache.kafka.common.serialization.StringDeserializer
> {code}
>
> Broker config (server.properties) - We have 9 brokers:
> {code:java}
> num.io.threads=4
> num.network.threads=64
> socket.send.buffer.bytes=33554432
> socket.receive.buffer.bytes=33554432
> socket.request.max.bytes=104857600
> num.partitions=2
> log.retention.hours=1
> log.segment.bytes=536870912
> log.retention.check.interval.ms=60000
> zookeeper.connection.timeout.ms=1000000
> controlled.shutdown.enable=true
> auto.leader.rebalance.enable=true
> log.cleaner.enable=true
> log.cleaner.min.cleanable.ratio=0.1
> log.cleaner.threads=2
> log.cleanup.policy=delete
> log.cleaner.delete.retention.ms=86400000
> log.cleaner.io.max.bytes.per.second=1.7976931348623157E308
> num.recovery.threads.per.data.dir=1
> log.flush.interval.messages=9223372036854775807
> message.max.bytes=10000000
> replica.fetch.max.bytes=10000000
> default.replication.factor=2
> delete.topic.enable=true
> unclean.leader.election.enable=false
> compression.type=snappy
> inter.broker.protocol.version=2.4
> {code}
>
> Let me know if you need more data from me
>
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)