Hello Siva, To better understand your situation, I'd need to ask a few more questions:
1) What triggers your REBALANCING event? 2) Does your application contain any states? If yes, how are they configured (persistent or in-memory, is logging enabled, etc)? 3) What is your commit interval configured via "commit.interval.ms"? To have better insights on what's happening, you can 1) set the StateRestoreListener via KafkaStreams#setGlobalStateRestoreListener (details can be found here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-167%3A+Add+interface+for+the+state+store+restoration+process), to see how much data are being restored during the task resuming process, 2) monitor on state store restoration metrics ( https://kafka.apache.org/documentation/#kafka_streams_store_monitoring) such as "restore-latency-avg" and "restore-rate". 3) Look into your log4j and check for "partition revocation took" and "partition assignment took" entries and check their time difference. Guozhang On Sun, Jul 29, 2018 at 10:37 AM, Siva Ram <sivaraman...@gmail.com> wrote: > Hi, > > Kafka version 1.0.0 (can't upgrade to another version yet due to legacy > dependency) > > The stream application uses low level processor API and maintains state. A > topic is setup with 30 partitions and I had split to 2 stream application > instances consuming the same topic, each with 15 threads. The application > starts fine and moves well until REBALANCING occur. When it does, the > application takes long time to move to RUNNING status by itself. During > this time no exception, no additional logging occurs in the application. > > 1) Could this behavior be due to an issue on Kafka broker OR is this > related to the stream application? > > 2) What logging can we increase to get additional insight as to what cause > this pause state for a significant period of time (this is impacting the > throughput)? > > FYI, we have set the REQUEST TIMEOUT to max integer value to avoid > timeout. In the event we have a single application with 30 threads, I > don't see this long pause, but that means we need to increase the number of > threads and memory, which is vertical scaling and not feasible for handling > a topic with significant volume. > > *Instance 1:* > > 2018-07-29 01:45:43 INFO StreamStateListener22 - Stream application moved > from RUNNING to REBALANCING > 2018-07-29 02:15:59 INFO StreamStateListener22 - Stream application moved > from REBALANCING to RUNNING > > 2018-07-29 05:19:18 INFO StreamStateListener22 - Stream application moved > from RUNNING to REBALANCING > 2018-07-29 05:54:00 INFO StreamStateListener22 - Stream application moved > from REBALANCING to RUNNING > > *Instance 2:* > > 2018-07-29 01:45:58 INFO StreamStateListener22 - Stream application moved > from RUNNING to REBALANCING > 2018-07-29 02:41:22 INFO StreamStateListener22 - Stream application moved > from REBALANCING to RUNNING > > 2018-07-29 05:19:33 INFO StreamStateListener22 - Stream application moved > from RUNNING to REBALANCING > 2018-07-29 05:54:14 INFO StreamStateListener22 - Stream application moved > from REBALANCING to RUNNING > > > Thanks, > Siva > -- -- Guozhang