Re: [PR] Core: Limit memory used by ParallelIterable [iceberg]

via GitHub Sat, 20 Jul 2024 15:05:14 -0700


stevenzwu commented on code in PR #10691:
URL: https://github.com/apache/iceberg/pull/10691#discussion_r1685542742



##########
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##########
@@ -20,84 +20,117 @@
 
 import java.io.Closeable;
 import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
 import java.util.Iterator;
 import java.util.NoSuchElementException;
+import java.util.Optional;
+import java.util.concurrent.CompletableFuture;
 import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.ExecutorService;
-import java.util.concurrent.Future;
-import org.apache.iceberg.exceptions.RuntimeIOException;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.function.Supplier;
 import org.apache.iceberg.io.CloseableGroup;
 import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.io.CloseableIterator;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.io.Closer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class ParallelIterable<T> extends CloseableGroup implements 
CloseableIterable<T> {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(ParallelIterable.class);
+
+  // Logic behind default value: ParallelIterable is often used for file 
planning.
+  // Assuming that a DataFile or DeleteFile is about 500 bytes, a 30k limit 
uses 14.3 MB of memory.
+  private static final int DEFAULT_MAX_QUEUE_SIZE = 30_000;

Review Comment:
   finding a good default here is a bit tricky as it depends on two variables
   1) consumer speed which is hard to predict
   2) `Thread.sleep(10)` in the `hasNext` method for `checkTasks` while loop. 
Half the queue size should be large enough to avoid starving the consumer
   
   Anyway, I am good with the default here since I don't know how to come up 
with a better number. I would be ok to go even a little higher like 50K. even 
assuming 1KB per item, it is 50 MB which is pretty small in modern computer. 
since we are changing from unbounded to some bound, technically a higher value 
would not make problem worse compared to before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Limit memory used by ParallelIterable [iceberg]

Reply via email to