[ https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157418#comment-17157418 ]
sivabalan narayanan commented on HUDI-1082: ------------------------------------------- I ran some simulations. By and large, distribution seems to be fine, with 10% variance. weights for partition1 [0.2,0.3, 0.5] Testing for entry. Partition1 total inserts 924. Total inserts 10354 Actual dist [0=995, 1=1488, 2=2517] Expected dist [0=1030, 1=1513, 2=2457] Testing for entry. Partition1 total inserts 924. Total inserts 10378 Actual dist [0=1008, 1=1481, 2=2511] Expected dist [0=1030, 1=1513, 2=2457] Testing for entry. Partition1 total inserts 924. Total inserts 10393 Actual dist [0=979, 1=1502, 2=2519] Expected dist [0=1030, 1=1513, 2=2457] Testing for entry. Partition1 total inserts 924. Total inserts 10401 Actual dist [0=956, 1=1560, 2=2484] Expected dist [0=1030, 1=1513, 2=2457] Testing for entry. Partition1 total inserts 990. Total inserts 10350 Actual dist [0=1030, 1=1518, 2=2452] Expected dist [0=998, 1=1478, 2=2524] Testing for entry. Partition1 total inserts 990. Total inserts 12350 Actual dist [0=983, 1=1524, 2=2493] Expected dist [0=998, 1=1478, 2=2524] Testing for entry. Partition1 total inserts 990. Total inserts 14350 Actual dist [0=977, 1=1505, 2=2518] Expected dist [0=998, 1=1478, 2=2524] Testing for entry. Partition1 total inserts 990. Total inserts 16800 Actual dist [0=1065, 1=1479, 2=2456] Expected dist [0=998, 1=1478, 2=2524] Testing for entry. Partition1 total inserts 990. Total inserts 21000 Actual dist [0=1015, 1=1503, 2=2482] Expected dist [0=998, 1=1478, 2=2524] Testing for entry. Partition1 total inserts 1210. Total inserts 25170 Actual dist [0=974, 1=1489, 2=2537] Expected dist [0=1036, 1=1489, 2=2475] Testing for entry. Partition1 total inserts 1210. Total inserts 41368 Actual dist [0=981, 1=1583, 2=2436] Expected dist [0=1036, 1=1489, 2=2475] Testing for entry. Partition1 total inserts 1210. Total inserts 41246 Actual dist [0=1027, 1=1464, 2=2509] Expected dist [0=1036, 1=1489, 2=2475] Testing for entry. Partition1 total inserts 1210. Total inserts 41930 Actual dist [0=1024, 1=1449, 2=2527] Expected dist [0=1036, 1=1489, 2=2475] Testing for entry. Partition1 total inserts 1210. Total inserts 94320 Actual dist [0=1033, 1=1459, 2=2508] Expected dist [0=1036, 1=1489, 2=2475] Testing for entry. Partition1 total inserts 1210. Total inserts 90000 Actual dist [0=996, 1=1510, 2=2494] Expected dist [0=1036, 1=1489, 2=2475] > Bug in deciding the upsert/insert buckets > ----------------------------------------- > > Key: HUDI-1082 > URL: https://issues.apache.org/jira/browse/HUDI-1082 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core > Affects Versions: 0.6.0 > Reporter: sivabalan narayanan > Assignee: Hong Shen > Priority: Major > Fix For: 0.6.1 > > > In > [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]], > when getPartition(Object key) is called, the logic to determine where the > record to be inserted is relying on globalInsertCounts where as this should > be perPartitionInsertCount. > > Bcoz, the weights for all targetInsert buckets are determined based on total > Inserts going into the partition of interest. // check like 200. Whereas when > getPartition(key) is called, we use global insert count to determine the > right bucket. > > For instance, > P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be > inserted is 100. > P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be > inserted is 10025. > So, ideal allocation is > P1: B0 -> 20, B1 -> 50, B2 -> 30 > P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503 > > getPartition() for a key is determined based on following. > mod (hash value, inserts)/ inserts. > Instead of considering inserts for the partition of interest, currently we > take global insert counts. > Lets say, these are the hash values for insert records in P1. > 5, 14, 20, 25, 90, 500, 1001, 5180. > record hash | expected bucket in P1 | actual bucket in P1 | > 5 | B0 | B0 > 14 | B0 | B0 > 21 | B1 | B0 > 30 | B1 | B0 > 90 | B2 | B0 > 500 | B0 | B0 > 1490 | B2 | B1 > 10019 | B0 | B3 > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)