[ https://issues.apache.org/jira/browse/KAFKA-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963303#comment-16963303 ]
amuthan Ganeshan commented on KAFKA-9073: ----------------------------------------- Okay, Today morning I got the same error in my staging environment, luckily I got INFO level logging this time, I just attached the log for you guys to review, please let me know if you find any potential root cause. Thanks. > Kafka Streams State stuck in rebalancing after one of the StreamThread > encounters java.lang.IllegalStateException: No current assignment for > partition > ------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: KAFKA-9073 > URL: https://issues.apache.org/jira/browse/KAFKA-9073 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.3.0 > Reporter: amuthan Ganeshan > Priority: Major > Attachments: KAFKA-9073.log > > > I have a Kafka stream application that stores the incoming messages into a > state store, and later during the punctuation period, we store them into a > big data persistent store after processing the messages. > The application consumes from 120 partitions distributed across 40 instances. > The application has been running fine without any problem for months, but all > of a sudden some of the instances failed because of a stream thread exception > saying > ```java.lang.IllegalStateException: No current assignment for partition > <app_name>-<store_name>-changelog-98``` > > And other instances are stuck in the REBALANCING state, and never comes out > of it. Here is the full stack trace, I just masked the application-specific > app name and store name in the stack trace due to NDA. > > ``` > 2019-10-21 13:27:13,481 ERROR > [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3] > [org.apache.kafka.streams.processor.internals.StreamThread] [] stream-thread > [application.id-a2c06c51-bfc7-449a-a094-d8b770caee92-StreamThread-3] > Encountered the following error during processing: > java.lang.IllegalStateException: No current assignment for partition > application.id-store_name-changelog-98 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:319) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.requestFailed(SubscriptionState.java:618) > at > org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:709) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30) > at > org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177) > at > org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:574) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:388) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:294) > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeUpdateStandbyTasks(StreamThread.java:1126) > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:923) > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:805) > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:774) > ``` > > Now I checked the state sore disk usage; it is less than 40% of the total > disk space available. Restarting the application solves the problem for a > short amount of time, but the error popping up randomly on some other > instances quickly. I tried to change the retry and retry.backoff.ms > configuration but not helpful at all > ``` > retries = 2147483647 > retry.backoff.ms > ``` > After googling for some time I found there was a similar bug reported to the > Kafka team in the past, and also notice my stack trace is exactly matching > with the stack trace of the reported bug. > Here is the link for the bug reported on a comparable basis a year ago. > https://issues.apache.org/jira/browse/KAFKA-7181 > > Now I am wondering is there a workaround for this bug though configuration > changes, or is there something wrong the way I set up the application, the > following are the configuration I have for my stream application. > > ``` > consumer.session.timeout.ms=30000 > metric.reporters=org.apache.kafka.common.metrics.JmxReporter > replication.factor=3 > metadata.max.age.ms=30000 > max.partition.fetch.bytes=2000000 > producer.retries=2147483647 > bootstrap.servers= <bootstrap server list goes here> > metrics.recording.level=DEBUG > producer.retry.backoff.ms=60000 > consumer.auto.offset.reset=latest > application.server=0.0.0.0:6063 > num.standby.replicas=1 > max.poll.records=2 > group.initial.rebalance.delay.ms=30000 > state.dir= <state dir path goes here> > heartbeat.interval.ms=10000 > max.poll.interval.ms=300000 > num.stream.threads=10 > application.id= <application id goes here> > ``` > Note: The original bug reported a year back got a conclusion that it is > related to https://issues.apache.org/jira/browse/KAFKA-7657 and reported > solved in version 2.2.0, but I am using the latest 2.3.0 version. > I appreciate your help concerning this bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)