aokolnychyi commented on a change in pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#discussion_r649370371



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -162,4 +163,27 @@ private void validateOptions() {
         "Cannot set %s is less than 1. All values less than 1 have the same 
effect as 1. %d < 1",
         MIN_INPUT_FILES, minInputFiles);
   }
+
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Ideally every Spark Task that is generated will be less than or equal to 
our target size but
+   * in practice this is not the case. When we actually write our files, they 
may exceed the target
+   * size and end up being split. This would end up producing 2 files out of 
one task, one target sized
+   * and one very small file. Since the output file can vary in size, it is 
better to
+   * use a slightly larger (but still within threshold) size for actually 
writing the tasks out.

Review comment:
       We have seen this quite a lot when tiny Parquet files are compacted into 
larger ones as it changes the encoding on many columns. In most cases, the 
actual file size is bigger than what we estimated. 
   
   I am not sure about Avro. Cases where the estimation is precise enough 
should work as expected. The main cases we are trying to avoid is splitting 514 
MB into 512 and 2 MB files and writing 1 GB files when the target is 512 MB.
   
   The ideal solution is to know how much the remaining rows are going to cost 
us. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to