findepi commented on code in PR #10691:
URL: https://github.com/apache/iceberg/pull/10691#discussion_r1686408402
##########
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##########
@@ -20,84 +20,117 @@
import java.io.Closeable;
import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
import java.util.Iterator;
import java.util.NoSuchElementException;
+import java.util.Optional;
+import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Future;
-import org.apache.iceberg.exceptions.RuntimeIOException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Supplier;
import org.apache.iceberg.io.CloseableGroup;
import org.apache.iceberg.io.CloseableIterable;
import org.apache.iceberg.io.CloseableIterator;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.io.Closer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
public class ParallelIterable<T> extends CloseableGroup implements
CloseableIterable<T> {
+
+ private static final Logger LOG =
LoggerFactory.getLogger(ParallelIterable.class);
+
+ // Logic behind default value: ParallelIterable is often used for file
planning.
+ // Assuming that a DataFile or DeleteFile is about 500 bytes, a 30k limit
uses 14.3 MB of memory.
+ private static final int DEFAULT_MAX_QUEUE_SIZE = 30_000;
Review Comment:
Here we should use "the lowest number possible that doesn't affect
performance negatively", rather than "the highest number possible that we
believe will not blow up the memory". 50 MB is indeed very little. But
uncontrolled 50 MB allocation _per worker thread_ with eg 128 threads doing
some work is 6.25 GB. Some applications like Trino initialize their worker pool
size to # cores or twice # cores, so 128 threads is not an absurd number. Of
course, a machine with 64 cores will have a lot more memory than this mere ~6
GB. Yet, ~6 GB of uncontrolled (and potentially unnecessary) allocation may
lead to problems. Of course, same can be said about 30k size limit (3.75 GB
allocation with 128 threads). I am not saying this particular number is the
best one. But I would rather spend more energy in making the number _smaller_.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]