rdblue commented on code in PR #10691:
URL: https://github.com/apache/iceberg/pull/10691#discussion_r1687083709
##########
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##########
@@ -20,84 +20,117 @@
import java.io.Closeable;
import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
import java.util.Iterator;
import java.util.NoSuchElementException;
+import java.util.Optional;
+import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Future;
-import org.apache.iceberg.exceptions.RuntimeIOException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Supplier;
import org.apache.iceberg.io.CloseableGroup;
import org.apache.iceberg.io.CloseableIterable;
import org.apache.iceberg.io.CloseableIterator;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.io.Closer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
public class ParallelIterable<T> extends CloseableGroup implements
CloseableIterable<T> {
+
+ private static final Logger LOG =
LoggerFactory.getLogger(ParallelIterable.class);
+
+ // Logic behind default value: ParallelIterable is often used for file
planning.
+ // Assuming that a DataFile or DeleteFile is about 500 bytes, a 30k limit
uses 14.3 MB of memory.
+ private static final int DEFAULT_MAX_QUEUE_SIZE = 30_000;
Review Comment:
I disagree with "the lowest number possible that doesn't affect performance
negatively" because we don't know what the affect on performance is. Until we
do, we should be more conservative.
Being conservative means a higher limit, hopefully high enough that most use
cases with an active consumer never hit that number of data files in a single
scan. At the same time, we do want to balance memory consumption in cases when
the consumer does pause.
I think that 30k is a good limit because it is higher than the number of
files produced by most scans and has a reasonable limit for memory consumption.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]