Re: [PR] Core: Limit memory used by ParallelIterable [iceberg]

via GitHub Mon, 22 Jul 2024 13:19:04 -0700


rdblue commented on code in PR #10691:
URL: https://github.com/apache/iceberg/pull/10691#discussion_r1687083709



##########
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##########
@@ -20,84 +20,117 @@
 
 import java.io.Closeable;
 import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
 import java.util.Iterator;
 import java.util.NoSuchElementException;
+import java.util.Optional;
+import java.util.concurrent.CompletableFuture;
 import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Future;
-import org.apache.iceberg.exceptions.RuntimeIOException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Supplier;
 import org.apache.iceberg.io.CloseableGroup;
 import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.io.Closer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class ParallelIterable<T> extends CloseableGroup implements 
CloseableIterable<T> {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(ParallelIterable.class);
+
+  // Logic behind default value: ParallelIterable is often used for file 
planning.
+  // Assuming that a DataFile or DeleteFile is about 500 bytes, a 30k limit 
uses 14.3 MB of memory.
+  private static final int DEFAULT_MAX_QUEUE_SIZE = 30_000;

Review Comment:
   I disagree with "the lowest number possible that doesn't affect performance 
negatively" because we don't know what the affect on performance is. Until we 
do, we should be more conservative.
   
   Being conservative means a higher limit, hopefully high enough that most use 
cases with an active consumer never hit that number of data files in a single 
scan. At the same time, we do want to balance memory consumption in cases when 
the consumer does pause.
   
   I think that 30k is a good limit because it is higher than the number of 
files produced by most scans and has a reasonable limit for memory consumption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Limit memory used by ParallelIterable [iceberg]

Reply via email to