varun-lakhyani commented on PR #15341:
URL: https://github.com/apache/iceberg/pull/15341#issuecomment-4027799942

   Updated 
   
   > This is really great initial testing, I've only take a small look but I 
would recommend you try out using ParallelIterable for parallelism rather than 
the current implementation. You could use this in conjunction with 
ThreadPools.getWorkerPool as a default (which auto scales with the reported cpu 
stats) although I think leaving it configurable is also interesting and good 
for testing)
   > 
   > ```java
   > Iterable<Iterable<T>> taskIterables = tasks.stream()
   >     .map(task -> (Iterable<T>) () -> open(task))
   >     .collect(toList());
   > ParallelIterable<T> parallel = new ParallelIterable<>(taskIterables, 
ThreadPools.getWorkerPool());
   > ```
   > 
   > Could you try that out with your same benchmark? I know you are using only 
10 files, but I'd really bet interested at the scale of improvement all the way 
up to 10 threads (one per file) My hunch is we can basically get an order of 
magnitude at least
   
   Updated the code and ran benchmark.
   
   <meta charset="utf-8"><b style="font-weight:normal;" 
id="docs-internal-guid-485085b4-7fff-a184-8394-da3012684fec"><p dir="ltr" 
style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span 
style="font-size:12pt;font-family:'Times New 
Roman',serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">For
 1000 files - 15-20 Kb each</span></p><div dir="ltr" style="margin-left:0pt;" 
align="left">
   Overhead (ms) | Async (s) | Sync (s) (existing) | % Improvement
   -- | -- | -- | --
   No manual Overhead | 0.855 | 0.842 | -1.5%
   1 | 0.863 | 2.674 | 67.7%
   5 | 1.172 | 9.270 | 87.4%
   10 | 1.670 | 15.629 | 89.3%
   15 | 2.145 | 21.100 | 89.8%
   
   </div></b>
   
   Concern here is when no manual overhead consider async actually is taking 
bit extra time to manage threads via this impleemntation but as overheads 
increases (which is common for cloud connections) betterment is nearly 90%


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to