Hi all, I’ve got a Kafka Streams application running in a Kubernetes environment. The topology on this application has 2 aggregations (and therefore 2 Ktables), both of which can get fairly large – the first is around 200GB and the second around 500GB. As with any K8s platform, pods can occasionally get rescheduled or go down, which of course will cause my application to rebalance. However, what I’m seeing is the application will literally spend hours rebalancing, without any errors being thrown or other obvious causes for the frequent rebalances – all I can see in the logs is an instance will be restoring a state store from the changelog topic, then suddenly it will have its partitions revoked and begin the join-group process all over again. (I’m running 10 pods/instances of my app, and I see this same pattern in each instance) In some cases it never really recovers from this rebalancing cycle – even after 12 hours or more - and I’ve had to scale down the application completely and start over by purging the application state and re-consuming from earliest on the source topic. Interestingly, after purging and starting from scratch the application seems to recover from rebalances pretty easily.
The storage I’m using is a NAS device, which admittedly is not particularly fast. (it’s using spinning disks and is shared amongst other tenants) As an experiment, I’ve tried switching the k8s storage to an in-memory option (this is at the k8s layer - the application is still using the same RocksDB stores) to see if that helps. As it turns out, I never have the rebalance problem when using an in-memory persistence layer. If a pod goes down, the application spends around 10 - 15 minutes rebalancing and then is back to processing data again. At this point I guess my main question is: when I’m using the NAS storage and the state stores are fairly large, could I be hitting some timeout somewhere that isn’t allowing the restore process to complete, which then triggers another rebalance? In other words, the restore process is simply taking too long given the amount of data needed to restore and the slow storage? I’m currently using Kafka 2.4.1, but I saw this same behavior in 2.3. I am using a custom RocksDB config setter to limit off-heap memory, but I’ve tried removing that and saw no difference in the rebalance problem. Again, no errors that I’m seeing or anything else in the logs that seems to indicate why it can never finish rebalancing. I’ve tried turning on DEBUG logging but I’m having a tough time sifting through the amount of log messages, though I’m still looking. If anyone has any ideas I would appreciate it, thanks! Alex C