Hi,
I'm currently trying to debug an error that happens in some of our services
that I do not understand. To give some context:
- We are using compacted topics for communication between multiple
microservices with one topic per business entity.
- The services are spring boot based and we are using the
spring-cloud-stream-kafka-binder.
- Since we want to have exactly once semantic and also need to coordinate the
kafka transactions with database transactions we are using the
"ChainedTransactionManager". This should (in theory, take care of all the messy
transaction handling).
Here is an excerpt of the logs when the error happens, I will comment it a bit
to make it easier to understand (logs lines are intended, my comments are on
the top level.
2019-10-10T08:59:19.945+00:00 DEBUG
customer.remoteuser.event.UserEventAdapter : Received event user.updated for
subject 9b3388e7-db34-47c6-b6ae-370e1899baef
The service received a new event. To my understanding it will write the read
offset to __consumer_offsets. Since we configured it transactional it will now
create a transactional producer so the write to __consumer_offsets will be in
the same transaction as any other messages the might be written out (i.e. to
other topics).
2019-10-10T08:59:20.104+00:00 INFO
org.apache.kafka.clients.producer.ProducerConfig : ProducerConfig values:
acks = 1
batch.size = 16384
bootstrap.servers = [hidden]
buffer.memory = 33554432
client.id =
compression.type = none
connections.max.idle.ms = 540000
enable.idempotence = true
interceptor.classes = []
key.serializer = class
org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class
org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 0
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = [hidden]
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = [hidden]
ssl.keystore.password = [hidden]
ssl.keystore.type = JKS
ssl.protocol = SSL
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = [hidden]
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = c-customerservice-0c-customerservice.dev-users-v1.0
value.serializer = class
org.apache.kafka.common.serialization.ByteArraySerializer
I noticed that the transaction.timeout is a full minute. Other than that it
does not look special to me.
2019-10-10T08:59:20.106+00:00 INFO
org.apache.kafka.clients.producer.KafkaProducer : [Producer
clientId=producer-31,
transactionalId=c-customerservice-0c-customerservice.dev-users-v1.0]
Instantiated a transactional producer.
2019-10-10T08:59:20.106+00:00 INFO
org.apache.kafka.clients.producer.KafkaProducer : [Producer
clientId=producer-31,
transactionalId=c-customerservice-0c-customerservice.dev-users-v1.0] Overriding
the default retries config to the recommended value of 2147483647 since the
idempotent producer is enabled.
2019-10-10T08:59:20.107+00:00 INFO
org.apache.kafka.clients.producer.KafkaProducer : [Producer
clientId=producer-31,
transactionalId=c-customerservice-0c-customerservice.dev-users-v1.0] Overriding
the default acks to all since idempotence is enabled.
2019-10-10T08:59:20.842+00:00 INFO
org.apache.kafka.common.utils.AppInfoParser : Kafka version : 2.0.1
2019-10-10T08:59:20.842+00:00 INFO
org.apache.kafka.common.utils.AppInfoParser : Kafka commitId : fa14705e51bd2ce5
2019-10-10T08:59:20.843+00:00 INFO
org.apache.kafka.clients.producer.internals.TransactionManager : [Producer
clientId=producer-31,
transactionalId=c-customerservice-0c-customerservice.dev-users-v1.0] ProducerId
set to -1 with epoch -1
2019-10-10T08:59:21.115+00:00 INFO
org.apache.kafka.clients.producer.internals.TransactionManager : [Producer
clientId=producer-31,
transactionalId=c-customerservice-0c-customerservice.dev-users-v1.0] ProducerId
set to 17778 with epoch 60
This is interesting to me, why the double output? I *suspect* the producer is
created "empty" and then actually communicates with the broker and received a
proper ProducerId and gets assigned an epoch. Is this assumption correct?
2019-10-10T08:59:21.146+00:00 DEBUG customer.CustomerService :
createOrUpdate updated existing customer 9b3388e7-db34-47c6-b6ae-370e1899baef
2019-10-10T08:59:21.315+00:00 ERROR
org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer
: Send offsets to transaction failed
org.apache.kafka.common.errors.ProducerFencedException: Producer
attempted an operation with an old epoch. Either there is a newer producer with
the same transactionalId, or the producer's transaction has been expired by the
broker.
And this it where it ends (being Java there are more exception but this is
always the root cause) and I do not understand whats happening. There are to
possible options listed:
A) "there is a newer producer with the same transactionalId" - I am certain
that there is not. There is exactly one instance of the service running and the
transaction id contains the service name. Since spring kafka also creates a
transactionId per group->topic->partition there really should not be any other
producer with this combination of attributes.
B) "transaction has been expired by the broker" - Well, the transaction timeout
is 60000ms and this error happens only 1500ms after the producer creation. So
this option is also out of the window.
Is there any other explanation possible? Do you need more information?
Regards
Martin