Hi Karthick,

The choice has to be yours depending on what you want to achieve. I
understand you want to achieve even distribution of messages across your
partitions. This depends on the following factors:

   - The frequency of keys
   - Hashing logic itself

What you can control is the hashing logic - one of the ways could be
hardcoding the keys and corresponding partition number in your logic (this
is assuming that you have a small pool of distinct keys). This will
definitively ensure that your algorithm is not 'biased' when returning the
partition number. For example:

key1 : partition 0
key2 : partition 1
key3 : partition 2
key4 : partition 3
key5 : partition 4
key6 : partition 0
.
.
.

However, if your data contains a high number of specific keys, skewness
cannot be entirely avoided. For example: if you have key1, key2 being
produced most of the times, then you will observe partitions 0 and 1 to be
loaded more than the other partitions.

You need to identify the reason for skewness. Is it the hashing algorithm
or frequency of keys itself that is causing skewness? If it is the
frequency of keys, then there is not much that can be done with just one
topic alone. In which case you will have to get creative with your topic
design - for example you can have separate topics for certain high
frequency keys!

Moreover, first you should assess why you have 96 partitions. In my
experience that is way too high.

Thanks

On Tue, Aug 20, 2024 at 4:36 PM Karthick <ibmkarthickma...@gmail.com> wrote:

> Hi Akash Jain
> Thanks for the reply seeking help for the same to choose hashing logics.
> Please refer/suggest any.
>
> On Sat, Aug 17, 2024 at 10:21 AM Akash Jain <akashjain0...@gmail.com>
> wrote:
>
> > Hi Karthick. You could implement your own custom partitioner.
> >
> > On Saturday, August 17, 2024, Karthick <ibmkarthickma...@gmail.com>
> wrote:
> >
> > > Hi Team,
> > >
> > > I'm using Kafka partitioning to maintain field-based ordering across
> > > partitions, but I'm experiencing data skewness among the partitions. I
> > have
> > > 96 partitions, and I'm sending data with 500 distinct keys that are
> used
> > > for partitioning. While monitoring the Kafka cluster, I noticed that a
> > few
> > > partitions are underutilized while others are overutilized.
> > >
> > > This seems to be a hashing problem. Can anyone suggest a better hashing
> > > technique or partitioning strategy to balance the load more
> effectively?
> > >
> > > Thanks in advance for your help.
> > >
> >
>

Reply via email to