[ 
https://issues.apache.org/jira/browse/KAFKA-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Reid updated KAFKA-2223:
--------------------------------
    Status: Patch Available  (was: Open)

> Improve distribution of data when using hash-based partitioning
> ---------------------------------------------------------------
>
>                 Key: KAFKA-2223
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2223
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>         Attachments: KAFKA-2223.patch
>
>
> Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the 
> hash code of keys modulo the number of partitions, along the lines of 
> {code}partition = key.hashCode() % numPartitions{code} (converting to 
> absolute value is ommitted here)
> This approach is entirely dependent on the _lower bits_ of the hash code 
> being uniformly distributed in order to get good distribution of records over 
> multiple partitions. If the lower bits of the key hash code are not uniformly 
> distributed, the key space will not be uniformly distributed over the 
> partitions.
> It can be surprisingly easy to get a very poor distribution. As a simple 
> example, if the keys are integer values and are all divisible by 2, then only 
> half of the partitions will receive data (as the hash code of an integer is 
> the integer value itself).
> This can even be a problem in situations where you would really not expect 
> it. For example, taking the 8-byte big-endian byte-array representation of 
> longs for each timestamp of each second over a period of 24 hours (at 
> millisecond granularity) and partitioning it over 50 partitions results in 34 
> of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that 
> applies a supplementary hash function to the return value of the key's 
> hashCode method. This same approach is taken in java.util.HashMap for the 
> exact same reason.
> One potential issue for a change like this to the default partitioner could 
> be backward compatibility, if there is some kind of logic expecting that a 
> given key would be sent to a given partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to