weimingdiit commented on code in PR #7362:
URL: https://github.com/apache/hudi/pull/7362#discussion_r1040556114


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "record size estimate compute dynamically based on commit 
metadata. "
           + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
 
+  public static final ConfigProperty<String> 
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
+          .key("hoodie.copyonwrite.record.dynamic.sample.maxnum")
+          .defaultValue(String.valueOf(100))
+          .withDocumentation("Although dynamic sampling is adopted, if the 
record size assumed by the user is unreasonable during the first write 
execution, "
+                  + "files that are too large or too small will be generated. 
Therefore, sampling is conducted from the data set during the first write 
operation. "
+                  + "In order to ensure performance, this parameter controls 
the absolute value of sampling.");

Review Comment:
   In the case of the first write, it is difficult to set a reasonable default 
value. Especially in the case of a large amount of data written for the first 
time, the deviation of the default data size given by the user will be 
amplified by a large amount of data, resulting in a lot of small files. 
Therefore, it is more reasonable to give the user two sampling parameters 
(absolute number and sampling proportion) in the case of the first write.
   If you remove the [hoodie.copyonwrite.record.dynamic.sample.maxnum] 
parameters, 
   just set the [hoodie.copyonwrite.record.dynamic.sample.ratio],So in the case 
of writing is a large amount of data for the first time, even if the 
[hoodie.copyonwrite.record.dynamic.sample.ratio] set is very small, need to the 
amount of sampling data are also likely to be very big, this is unnecessary.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to