[
https://issues.apache.org/jira/browse/KAFKA-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabriel Reid updated KAFKA-2223:
--------------------------------
Status: Patch Available (was: Open)
> Improve distribution of data when using hash-based partitioning
> ---------------------------------------------------------------
>
> Key: KAFKA-2223
> URL: https://issues.apache.org/jira/browse/KAFKA-2223
> Project: Kafka
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Attachments: KAFKA-2223.patch
>
>
> Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the
> hash code of keys modulo the number of partitions, along the lines of
> {code}partition = key.hashCode() % numPartitions{code} (converting to
> absolute value is ommitted here)
> This approach is entirely dependent on the _lower bits_ of the hash code
> being uniformly distributed in order to get good distribution of records over
> multiple partitions. If the lower bits of the key hash code are not uniformly
> distributed, the key space will not be uniformly distributed over the
> partitions.
> It can be surprisingly easy to get a very poor distribution. As a simple
> example, if the keys are integer values and are all divisible by 2, then only
> half of the partitions will receive data (as the hash code of an integer is
> the integer value itself).
> This can even be a problem in situations where you would really not expect
> it. For example, taking the 8-byte big-endian byte-array representation of
> longs for each timestamp of each second over a period of 24 hours (at
> millisecond granularity) and partitioning it over 50 partitions results in 34
> of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that
> applies a supplementary hash function to the return value of the key's
> hashCode method. This same approach is taken in java.util.HashMap for the
> exact same reason.
> One potential issue for a change like this to the default partitioner could
> be backward compatibility, if there is some kind of logic expecting that a
> given key would be sent to a given partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)