Hi, We have a Kafka spark streaming integrated app that listens to twitter and pushes the tweets to Kafka and which is later consumed by spark app.
We are constantly seeing one of the Kafka partitions always having more data than the other partitions. Not able to zero in on the root cause. We use tweet id as the key and based on which we even partition. We established that tweet ids have very equal distribution (snowflake) don't see any issues with distribution (% even, % prime, % odd number of partitions). But still partition 3 has more data and the offset range of this partition is always more than the other partitions offset range. Any suggestions or directions to debug this further would be much appreciated. Thank you. Gurupraveen