[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-04 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r465428805



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java
##
@@ -252,8 +250,27 @@ public void testUpsertPartitionerWithSmallInsertHandling() 
throws Exception {
 assertEquals(BucketType.INSERT, partitioner.getBucketInfo(2).bucketType,
 "Bucket 2 is INSERT");
 assertEquals(3, insertBuckets.size(), "Total of 3 insert buckets");
-assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket 
must be same as update bucket");
-assertEquals(0.5, insertBuckets.get(0).weight, 0.01, "First insert bucket 
should have weight 0.5");
+
+assertEquals(0, insertBuckets.get(0).getKey().bucketNumber,
+"First insert bucket must be same as update bucket");

Review comment:
   Got it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-04 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r465428805



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java
##
@@ -252,8 +250,27 @@ public void testUpsertPartitionerWithSmallInsertHandling() 
throws Exception {
 assertEquals(BucketType.INSERT, partitioner.getBucketInfo(2).bucketType,
 "Bucket 2 is INSERT");
 assertEquals(3, insertBuckets.size(), "Total of 3 insert buckets");
-assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket 
must be same as update bucket");
-assertEquals(0.5, insertBuckets.get(0).weight, 0.01, "First insert bucket 
should have weight 0.5");
+
+assertEquals(0, insertBuckets.get(0).getKey().bucketNumber,
+"First insert bucket must be same as update bucket");

Review comment:
   got it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-03 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r464775753



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##
@@ -270,20 +273,24 @@ public int getPartition(Object key) {
   return updateLocationToBucket.get(location.getFileId());
 } else {
   String partitionPath = keyLocation._1().getPartitionPath();
-  List targetBuckets = 
partitionPathToInsertBuckets.get(partitionPath);
+  List targetBuckets = 
partitionPathToInsertBucketInfos.get(partitionPath);
   // pick the target bucket to use based on the weights.
-  double totalWeight = 0.0;
   final long totalInserts = Math.max(1, 
profile.getWorkloadStat(partitionPath).getNumInserts());
   final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", 
keyLocation._1().getRecordKey());
   final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / 
totalInserts;
-  for (InsertBucket insertBucket : targetBuckets) {
-totalWeight += insertBucket.weight;
-if (r <= totalWeight) {
-  return insertBucket.bucketNumber;
-}
+
+  int index = Collections.binarySearch(targetBuckets, new 
InsertBucketCumulativeWeightPair(new InsertBucket(), r));
+
+  if (index >= 0) {

Review comment:
   Below add a testcase for Collections.binarySearch.
   ```
 @Test
 public void test() {
   List cumulativeWeighs = new ArrayList<>();
   cumulativeWeighs.add(0.1);
   cumulativeWeighs.add(0.5);
   cumulativeWeighs.add(1.0);
   
   assertEquals (0, Collections.binarySearch(cumulativeWeighs, 0.1));
   assertEquals (1, Collections.binarySearch(cumulativeWeighs, 0.5));
   assertEquals (2, Collections.binarySearch(cumulativeWeighs, 1.0));
   
   assertEquals (-1, Collections.binarySearch(cumulativeWeighs, 0.01));
   assertEquals (-2, Collections.binarySearch(cumulativeWeighs, 0.2));
   assertEquals (-3, Collections.binarySearch(cumulativeWeighs, 0.6));
   assertEquals (-4, Collections.binarySearch(cumulativeWeighs, 1.1));
 }
   ```
   
   If the the search key is bigger than all the elements in the list, it will 
return (-(insertion point) - 1)= -3 -1 = -4 .
   So I have to add the following judge:
   ```
 if ((-1 * index - 1) < targetBuckets.size()) {
   return targetBuckets.get((-1 * index - 1)).getKey().bucketNumber;
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-03 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r464768984



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java
##
@@ -269,8 +267,8 @@ public void testUpsertPartitionerWithSmallInsertHandling() 
throws Exception {
 assertEquals(BucketType.INSERT, partitioner.getBucketInfo(3).bucketType,
 "Bucket 3 is INSERT");
 assertEquals(4, insertBuckets.size(), "Total of 4 insert buckets");
-assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket 
must be same as update bucket");
-assertEquals(200.0 / 2400, insertBuckets.get(0).weight, 0.01, "First 
insert bucket should have weight 0.5");
+assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, "First insert 
bucket must be same as update bucket");
+assertEquals(200.0 / 2400, insertBuckets.get(0).getKey().weight, 0.01, 
"First insert bucket should have weight 0.5");

Review comment:
   Sure, I will add for all buckets.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-01 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r464014872



##
File path: 
hudi-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java
##
@@ -269,8 +267,8 @@ public void testUpsertPartitionerWithSmallInsertHandling() 
throws Exception {
 assertEquals(BucketType.INSERT, partitioner.getBucketInfo(3).bucketType,
 "Bucket 3 is INSERT");
 assertEquals(4, insertBuckets.size(), "Total of 4 insert buckets");
-assertEquals(0, insertBuckets.get(0).bucketNumber, "First insert bucket 
must be same as update bucket");
-assertEquals(200.0 / 2400, insertBuckets.get(0).weight, 0.01, "First 
insert bucket should have weight 0.5");
+assertEquals(0, insertBuckets.get(0).getKey().bucketNumber, "First insert 
bucket must be same as update bucket");
+assertEquals(200.0 / 2400, insertBuckets.get(0).getKey().weight, 0.01, 
"First insert bucket should have weight 0.5");

Review comment:
   Yes, it should be 0.08, the testcase is not add by me, I am also curious.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-01 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r464014566



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##
@@ -270,20 +273,24 @@ public int getPartition(Object key) {
   return updateLocationToBucket.get(location.getFileId());
 } else {
   String partitionPath = keyLocation._1().getPartitionPath();
-  List targetBuckets = 
partitionPathToInsertBuckets.get(partitionPath);
+  List targetBuckets = 
partitionPathToInsertBucketInfos.get(partitionPath);
   // pick the target bucket to use based on the weights.
-  double totalWeight = 0.0;
   final long totalInserts = Math.max(1, 
profile.getWorkloadStat(partitionPath).getNumInserts());
   final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", 
keyLocation._1().getRecordKey());
   final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / 
totalInserts;
-  for (InsertBucket insertBucket : targetBuckets) {
-totalWeight += insertBucket.weight;
-if (r <= totalWeight) {
-  return insertBucket.bucketNumber;
-}
+
+  int index = Collections.binarySearch(targetBuckets, new 
InsertBucketCumulativeWeightPair(new InsertBucket(), r));
+
+  if (index >= 0) {

Review comment:
   The last buckets cumulative weight should be 1, and the search entry 
should not greater than the last entry.
   Even if the search entry greater than all entries, it will return the last 
bucketNumber.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-01 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r464013661



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##
@@ -99,7 +100,7 @@ public UpsertPartitioner(WorkloadProfile profile, 
JavaSparkContext jsc, HoodieTa
 assignInserts(profile, jsc);
 
 LOG.info("Total Buckets :" + totalBuckets + ", buckets info => " + 
bucketInfoMap + ", \n"
-+ "Partition to insert buckets => " + partitionPathToInsertBuckets + 
", \n"
++ "Partition to insert buckets => " + partitionPathToInsertBucketInfos 
+ ", \n"

Review comment:
   The parent class Pair has toString() method:
   ```
 @Override
 public String toString() {
   return new 
StringBuilder().append('(').append(getLeft()).append(',').append(getRight()).append(')').toString();
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-01 Thread GitBox


shenh062326 commented on a change in pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#discussion_r463932989



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##
@@ -272,21 +275,44 @@ public int getPartition(Object key) {
   String partitionPath = keyLocation._1().getPartitionPath();
   List targetBuckets = 
partitionPathToInsertBuckets.get(partitionPath);
   // pick the target bucket to use based on the weights.
-  double totalWeight = 0.0;
   final long totalInserts = Math.max(1, 
profile.getWorkloadStat(partitionPath).getNumInserts());
   final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", 
keyLocation._1().getRecordKey());
   final double r = 1.0 * Math.floorMod(hashOfKey, totalInserts) / 
totalInserts;
-  for (InsertBucket insertBucket : targetBuckets) {
-totalWeight += insertBucket.weight;
-if (r <= totalWeight) {
-  return insertBucket.bucketNumber;
-}
+
+  int index = binarySearch(targetBuckets, r);
+  if (index >= 0) {
+return targetBuckets.get(index).bucketNumber;
+  }
+
+  if (-1 * index - 1 < targetBuckets.size()) {

Review comment:
   Sure, I will fix the comments.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org