[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r465428805 ## File path: hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java ## @@ -252,8 +250,27 @@ public void testUpsertPartitionerWithSmallInsertHandling() throws Exception { assertEquals(BucketType.INSERT, partitioner.getBucketInfo(2).bucketType, "Bucket 2 is INSERT"); assertEquals(3, insertBuckets.size(), "Total of 3 insert buckets"); -assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket must be same as update bucket"); -assertEquals(0.5, insertBuckets.get(0).weight, 0.01, "First insert bucket should have weight 0.5"); + +assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, +"First insert bucket must be same as update bucket"); Review comment: Got it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r465428805 ## File path: hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java ## @@ -252,8 +250,27 @@ public void testUpsertPartitionerWithSmallInsertHandling() throws Exception { assertEquals(BucketType.INSERT, partitioner.getBucketInfo(2).bucketType, "Bucket 2 is INSERT"); assertEquals(3, insertBuckets.size(), "Total of 3 insert buckets"); -assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket must be same as update bucket"); -assertEquals(0.5, insertBuckets.get(0).weight, 0.01, "First insert bucket should have weight 0.5"); + +assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, +"First insert bucket must be same as update bucket"); Review comment: got it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r464775753 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java ## @@ -270,20 +273,24 @@ public int getPartition(Object key) { return updateLocationToBucket.get(location.getFileId()); } else { String partitionPath = keyLocation._1().getPartitionPath(); - List targetBuckets = partitionPathToInsertBuckets.get(partitionPath); + List targetBuckets = partitionPathToInsertBucketInfos.get(partitionPath); // pick the target bucket to use based on the weights. - double totalWeight = 0.0; final long totalInserts = Math.max(1, profile.getWorkloadStat(partitionPath).getNumInserts()); final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", keyLocation._1().getRecordKey()); final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / totalInserts; - for (InsertBucket insertBucket : targetBuckets) { -totalWeight += insertBucket.weight; -if (r <= totalWeight) { - return insertBucket.bucketNumber; -} + + int index = Collections.binarySearch(targetBuckets, new InsertBucketCumulativeWeightPair(new InsertBucket(), r)); + + if (index >= 0) { Review comment: Below add a testcase for Collections.binarySearch. ``` @Test public void test() { List cumulativeWeighs = new ArrayList<>(); cumulativeWeighs.add(0.1); cumulativeWeighs.add(0.5); cumulativeWeighs.add(1.0); assertEquals (0, Collections.binarySearch(cumulativeWeighs, 0.1)); assertEquals (1, Collections.binarySearch(cumulativeWeighs, 0.5)); assertEquals (2, Collections.binarySearch(cumulativeWeighs, 1.0)); assertEquals (-1, Collections.binarySearch(cumulativeWeighs, 0.01)); assertEquals (-2, Collections.binarySearch(cumulativeWeighs, 0.2)); assertEquals (-3, Collections.binarySearch(cumulativeWeighs, 0.6)); assertEquals (-4, Collections.binarySearch(cumulativeWeighs, 1.1)); } ``` If the the search key is bigger than all the elements in the list, it will return (-(insertion point) - 1)= -3 -1 = -4 . So I have to add the following judge: ``` if ((-1 * index - 1) < targetBuckets.size()) { return targetBuckets.get((-1 * index - 1)).getKey().bucketNumber; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r464768984 ## File path: hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java ## @@ -269,8 +267,8 @@ public void testUpsertPartitionerWithSmallInsertHandling() throws Exception { assertEquals(BucketType.INSERT, partitioner.getBucketInfo(3).bucketType, "Bucket 3 is INSERT"); assertEquals(4, insertBuckets.size(), "Total of 4 insert buckets"); -assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket must be same as update bucket"); -assertEquals(200.0 / 2400, insertBuckets.get(0).weight, 0.01, "First insert bucket should have weight 0.5"); +assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, "First insert bucket must be same as update bucket"); +assertEquals(200.0 / 2400, insertBuckets.get(0).getKey().weight, 0.01, "First insert bucket should have weight 0.5"); Review comment: Sure, I will add for all buckets. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r464014872 ## File path: hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java ## @@ -269,8 +267,8 @@ public void testUpsertPartitionerWithSmallInsertHandling() throws Exception { assertEquals(BucketType.INSERT, partitioner.getBucketInfo(3).bucketType, "Bucket 3 is INSERT"); assertEquals(4, insertBuckets.size(), "Total of 4 insert buckets"); -assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket must be same as update bucket"); -assertEquals(200.0 / 2400, insertBuckets.get(0).weight, 0.01, "First insert bucket should have weight 0.5"); +assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, "First insert bucket must be same as update bucket"); +assertEquals(200.0 / 2400, insertBuckets.get(0).getKey().weight, 0.01, "First insert bucket should have weight 0.5"); Review comment: Yes, it should be 0.08, the testcase is not add by me, I am also curious. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r464014566 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java ## @@ -270,20 +273,24 @@ public int getPartition(Object key) { return updateLocationToBucket.get(location.getFileId()); } else { String partitionPath = keyLocation._1().getPartitionPath(); - List targetBuckets = partitionPathToInsertBuckets.get(partitionPath); + List targetBuckets = partitionPathToInsertBucketInfos.get(partitionPath); // pick the target bucket to use based on the weights. - double totalWeight = 0.0; final long totalInserts = Math.max(1, profile.getWorkloadStat(partitionPath).getNumInserts()); final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", keyLocation._1().getRecordKey()); final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / totalInserts; - for (InsertBucket insertBucket : targetBuckets) { -totalWeight += insertBucket.weight; -if (r <= totalWeight) { - return insertBucket.bucketNumber; -} + + int index = Collections.binarySearch(targetBuckets, new InsertBucketCumulativeWeightPair(new InsertBucket(), r)); + + if (index >= 0) { Review comment: The last buckets cumulative weight should be 1, and the search entry should not greater than the last entry. Even if the search entry greater than all entries, it will return the last bucketNumber. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r464013661 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java ## @@ -99,7 +100,7 @@ public UpsertPartitioner(WorkloadProfile profile, JavaSparkContext jsc, HoodieTa assignInserts(profile, jsc); LOG.info("Total Buckets :" + totalBuckets + ", buckets info => " + bucketInfoMap + ", \n" -+ "Partition to insert buckets => " + partitionPathToInsertBuckets + ", \n" ++ "Partition to insert buckets => " + partitionPathToInsertBucketInfos + ", \n" Review comment: The parent class Pair has toString() method: ``` @Override public String toString() { return new StringBuilder().append('(').append(getLeft()).append(',').append(getRight()).append(')').toString(); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key
shenh062326 commented on a change in pull request #1868: URL: https://github.com/apache/hudi/pull/1868#discussion_r463932989 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java ## @@ -272,21 +275,44 @@ public int getPartition(Object key) { String partitionPath = keyLocation._1().getPartitionPath(); List targetBuckets = partitionPathToInsertBuckets.get(partitionPath); // pick the target bucket to use based on the weights. - double totalWeight = 0.0; final long totalInserts = Math.max(1, profile.getWorkloadStat(partitionPath).getNumInserts()); final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", keyLocation._1().getRecordKey()); final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / totalInserts; - for (InsertBucket insertBucket : targetBuckets) { -totalWeight += insertBucket.weight; -if (r <= totalWeight) { - return insertBucket.bucketNumber; -} + + int index = binarySearch(targetBuckets, r); + if (index >= 0) { +return targetBuckets.get(index).bucketNumber; + } + + if (-1 * index - 1 < targetBuckets.size()) { Review comment: Sure, I will fix the comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org