rdblue commented on code in PR #7731:
URL: https://github.com/apache/iceberg/pull/7731#discussion_r1209481290


##########
core/src/main/java/org/apache/iceberg/util/TableScanUtil.java:
##########
@@ -35,16 +38,22 @@
 import org.apache.iceberg.ScanTaskGroup;
 import org.apache.iceberg.SplittableScanTask;
 import org.apache.iceberg.StructLike;
+import org.apache.iceberg.io.CloseableGroup;
 import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
 import org.apache.iceberg.types.Types;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class TableScanUtil {
+  private static final Logger LOG = 
LoggerFactory.getLogger(TableScanUtil.class);
+  private static final long MIN_SPLIT_SIZE = 16 * 1024 * 1024; // 16 MB

Review Comment:
   What do you think it should be instead? 8 MB?
   
   This seems pretty reasonable to me. My rationale is that this min size 
should balance task overhead. We want to process enough data that we don't 
waste time in overhead, but not so much that we have a delay because tasks are 
too large. 16 MB will be processed fairly quickly and I doubt that it would be 
a very noticeable bottleneck. On the other hand, our file open cost is 4 MB. I 
would say that we want to be higher than the file open cost since that's a 
rough measure of overhead. So 8 or 16 seems about right to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to