Hi,

We have a Kafka spark streaming integrated app that listens to twitter and 
pushes the tweets to Kafka and which is later consumed by spark app.

We are constantly seeing one of the Kafka partitions always having more data 
than the other partitions. Not able to zero in on the root cause.

We use tweet id as the key and based on which we even partition. We established 
that tweet ids have very equal distribution (snowflake) don't see any issues 
with distribution (% even, % prime, % odd number of partitions). But still 
partition 3 has more data and the offset range of this partition is always more 
than the other partitions offset range.

Any suggestions or directions to debug this further would be much appreciated.

Thank you.
Gurupraveen

Reply via email to