[ 
https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157418#comment-17157418
 ] 

sivabalan narayanan commented on HUDI-1082:
-------------------------------------------

I ran some simulations. By and large, distribution seems to be fine, with 10% 
variance.

weights for partition1 [0.2,0.3, 0.5]

Testing for entry. Partition1 total inserts 924. Total inserts 10354
Actual dist [0=995, 1=1488, 2=2517]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10378
Actual dist [0=1008, 1=1481, 2=2511]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10393
Actual dist [0=979, 1=1502, 2=2519]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 924. Total inserts 10401
Actual dist [0=956, 1=1560, 2=2484]
Expected dist [0=1030, 1=1513, 2=2457]
 Testing for entry. Partition1 total inserts 990. Total inserts 10350
Actual dist [0=1030, 1=1518, 2=2452]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 12350
Actual dist [0=983, 1=1524, 2=2493]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 14350
Actual dist [0=977, 1=1505, 2=2518]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 16800
Actual dist [0=1065, 1=1479, 2=2456]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 990. Total inserts 21000
Actual dist [0=1015, 1=1503, 2=2482]
Expected dist [0=998, 1=1478, 2=2524]
 Testing for entry. Partition1 total inserts 1210. Total inserts 25170
Actual dist [0=974, 1=1489, 2=2537]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41368
Actual dist [0=981, 1=1583, 2=2436]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41246
Actual dist [0=1027, 1=1464, 2=2509]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 41930
Actual dist [0=1024, 1=1449, 2=2527]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 94320
Actual dist [0=1033, 1=1459, 2=2508]
Expected dist [0=1036, 1=1489, 2=2475]
 Testing for entry. Partition1 total inserts 1210. Total inserts 90000
Actual dist [0=996, 1=1510, 2=2494]
Expected dist [0=1036, 1=1489, 2=2475]

> Bug in deciding the upsert/insert buckets
> -----------------------------------------
>
>                 Key: HUDI-1082
>                 URL: https://issues.apache.org/jira/browse/HUDI-1082
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Writer Core
>    Affects Versions: 0.6.0
>            Reporter: sivabalan narayanan
>            Assignee: Hong Shen
>            Priority: Major
>             Fix For: 0.6.1
>
>
> In 
> [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
>  when getPartition(Object key) is called, the logic to determine where the 
> record to be inserted is relying on globalInsertCounts where as this should 
> be perPartitionInsertCount.
>  
> Bcoz, the weights for all targetInsert buckets are determined based on total 
> Inserts going into the partition of interest. // check like 200. Whereas when 
> getPartition(key) is called, we use global insert count to determine the 
> right bucket. 
>  
> For instance,
> P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
> inserted is 100.
> P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
> inserted is 10025. 
> So, ideal allocation is
> P1: B0 -> 20, B1 -> 50, B2 -> 30
> P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
>  
> getPartition() for a key is determined based on following.
> mod (hash value, inserts)/ inserts.
> Instead of considering inserts for the partition of interest, currently we 
> take global insert counts.
> Lets say, these are the hash values for insert records in P1.
> 5, 14, 20, 25, 90, 500, 1001, 5180.
> record hash | expected bucket in P1 | actual bucket in P1 |
> 5     | B0 | B0
> 14   | B0 | B0
> 21   | B1  | B0
> 30  | B1 | B0
> 90 | B2 | B0
> 500 | B0 | B0
> 1490 | B2 | B1
> 10019 | B0 | B3
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to