[jira] [Updated] (SAMZA-2502) Byte array keys be partitioned based on array contents in InMemorySystemProducer

Yixing Zhang (Jira) Thu, 02 Apr 2020 13:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/SAMZA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yixing Zhang updated SAMZA-2502:
--------------------------------
    Description: 
InMemorySystemProducer uses the hashCode of the partition key to decide to 
which partition the message goes. This works well when the key is an object 
whose hashCode method can be override. But in the case when the partition key 
is serialized as a byte[], the message can go to any partition. It turns out 
that the hash code of a byte array is based on the address in memory but not 
the content. Therefore, even though two messages may have same key, they can be 
sent to different partitions after their keys are serialized into byte[] whose 
hash code is kind of random.

 

We want to be able to partition messages based on the contents of the partition 
keys. An easy fix would be: in the case of byte array, we calculate the hash 
code with Arrays.hashCode(byte[] input). This allows us to calculate the hash 
code of the byte array by its contents.

  was:
InMemorySystemProducer uses the hashCode of the partition key to decide to 
which partition the message goes. This works well when the key is an object 
whose hashCode method can be override. But in the case when the partition key 
is serialized as a byte[], the message can go to any partition. It turns out 
that the hash code of a byte array is based on the address in memory but not 
the content. Therefore, even though two messages may have same key, they can be 
sent to different partitions after their keys are serialized into byte[] whose 
hash code is kind of random.

 

We want to be able to partition messages based on the contents of the partition 
keys. An easy fix would be: in the case of byte array, we calculate the hash 
code with Arrays.hashCode(byte[] input). This allows us to calculate the hash 
code of the byte array by it's contents.


> Byte array keys be partitioned based on array contents in 
> InMemorySystemProducer
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-2502
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2502
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Yixing Zhang
>            Assignee: Yixing Zhang
>            Priority: Major
>
> InMemorySystemProducer uses the hashCode of the partition key to decide to 
> which partition the message goes. This works well when the key is an object 
> whose hashCode method can be override. But in the case when the partition key 
> is serialized as a byte[], the message can go to any partition. It turns out 
> that the hash code of a byte array is based on the address in memory but not 
> the content. Therefore, even though two messages may have same key, they can 
> be sent to different partitions after their keys are serialized into byte[] 
> whose hash code is kind of random.
>  
> We want to be able to partition messages based on the contents of the 
> partition keys. An easy fix would be: in the case of byte array, we calculate 
> the hash code with Arrays.hashCode(byte[] input). This allows us to calculate 
> the hash code of the byte array by its contents.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (SAMZA-2502) Byte array keys be partitioned based on array contents in InMemorySystemProducer

Reply via email to