clintropolis commented on issue #8578: parallel broker merges on fork join pool URL: https://github.com/apache/incubator-druid/pull/8578#issuecomment-548152735 ### worst case The other really important part to highlight, is to determine how this approach degrades under increasing concurrency by looking at the absolute worst case scenario with the current approach, the concurrent load spike. I consider this the nemesis of the current parallelism chooser, since it will potentially dramatically under-estimate the available CPU capacity. For reference, I included a measurement when the `ParallelMergeCombiningSequence` level of parallelism was limited to 1. This is what the computed level of parallelism will trend towards when load is gradually increased and sustained, and so is how continued overloading is expected to behave. From the left of the following line plots, the vertical dotted lines represent the number of physical cores and number of hyper-threads of the benchmark machine. ![concurrency-m5 8xl-burst](https://user-images.githubusercontent.com/1577461/67905082-8c73a500-fb2d-11e9-8f7c-baecc07fc8e9.gif) This looks pretty bad for parallelism under huge load spikes, but also keep in mind that this measurement is looking at the time to complete all threads, not the average per thread timing (so at the far right of the plot, the time to simultaneously complete 4 times as many threads as we have processors). Using an offset of 10ms before each concurrent thread to help even out the load shrinks the gap of performance degradation significantly: ![concurrency-m5 8xl-10ms-offset-burst](https://user-images.githubusercontent.com/1577461/67904913-0e170300-fb2d-11e9-9e0b-24c8c076aff8.gif) and helps tie the greedy parallelism chooser to the poor load spike performance. #### blocking Looking at the `ParallelMergeCombiningSequenceJmhThreadsBenchmark` output shows similar observations, but the apparent degradation doesn't seem as bad because it is not entirely driven by the slowest thread. ![concurrency-m5 8xl-non-blocking-threaded](https://user-images.githubusercontent.com/1577461/67905376-5c78d180-fb2e-11e9-9e06-153180c8d738.gif) This is due to many of the concurrent threads not seeing particularly worse performance, but having the worst performing threads averaged into the overall time. It also produces a plot that grows in the same linear fashion as the same threaded approach currently in use. Introducing blocking input sequences caused some unexpected erratic behavior in the 10ms target task run time that I found surprising, but was repeatable. As far as I have been able to determine, the increased overall number of tasks that need to run on the `ForkJoinPool` and additional contention since there are more operations adding to and taking from a `BlockingQueue`. ![concurrency-m5 8xl-initially-blocking-100-500ms](https://user-images.githubusercontent.com/1577461/67904793-afea2000-fb2c-11e9-821a-4a7a2a23bdd1.gif) The larger task run time results in the least amount of both tasks and cooperative blocking operations, which is why it was ultimately chosen as the default. Longer delays and smaller result sets also shrink the apparent difference between the approach in this PR and the current approach, even without delays between threads: ![concurrency-m5 8xl-initially-blocking-4000-5000ms](https://user-images.githubusercontent.com/1577461/67905953-5a177700-fb30-11e9-8e87-e7f1999161d9.gif)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org