varun-lakhyani commented on PR #15341: URL: https://github.com/apache/iceberg/pull/15341#issuecomment-4027799942
Updated > This is really great initial testing, I've only take a small look but I would recommend you try out using ParallelIterable for parallelism rather than the current implementation. You could use this in conjunction with ThreadPools.getWorkerPool as a default (which auto scales with the reported cpu stats) although I think leaving it configurable is also interesting and good for testing) > > ```java > Iterable<Iterable<T>> taskIterables = tasks.stream() > .map(task -> (Iterable<T>) () -> open(task)) > .collect(toList()); > ParallelIterable<T> parallel = new ParallelIterable<>(taskIterables, ThreadPools.getWorkerPool()); > ``` > > Could you try that out with your same benchmark? I know you are using only 10 files, but I'd really bet interested at the scale of improvement all the way up to 10 threads (one per file) My hunch is we can basically get an order of magnitude at least Updated the code and ran benchmark. <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-485085b4-7fff-a184-8394-da3012684fec"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:12pt;font-family:'Times New Roman',serif;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">For 1000 files - 15-20 Kb each</span></p><div dir="ltr" style="margin-left:0pt;" align="left"> Overhead (ms) | Async (s) | Sync (s) (existing) | % Improvement -- | -- | -- | -- No manual Overhead | 0.855 | 0.842 | -1.5% 1 | 0.863 | 2.674 | 67.7% 5 | 1.172 | 9.270 | 87.4% 10 | 1.670 | 15.629 | 89.3% 15 | 2.145 | 21.100 | 89.8% </div></b> Concern here is when no manual overhead consider async actually is taking bit extra time to manage threads via this impleemntation but as overheads increases (which is common for cloud connections) betterment is nearly 90% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
