I'll comment more later but as a brief addressing of the last points: There wasn't as much of an effect on the memory usage as I expected once it got into the wild here. I think this is partly due to the fact that the merges get threaded in as soon as they are ready from the nodes, and if your skew on query times across your nodes is comparable to the parallel merging ability, then there isn't much accumulated on the brokers. So if a broker can merge 100 nodes in 0.5 seconds for example, and the variance in node response time on your historicals swings between 0.25s and 0.75s, then by the time the 0.25s ones have been merged in, the 0.75s ones start to stack up to be merged. I don't have anything to prove this hypothesis.
Starvation of the mergeFJP is a very real concern. That is why the `ForkJoinPool.managedBlock` is needed. In such a scenario the FJP detects that too many threads are blocked and adjusts its thread pool appropriately. I don't have strong examples of group-by queries on brokers using this approach, so I haven't exercised that code path yet, but the topN code path seems to be working as expected with this approach without encountering thread starvation. I'll definitely check out the `ChainedExecutionQueryRunner` approach [ Full content available at: https://github.com/apache/incubator-druid/pull/5913 ] This message was relayed via gitbox.apache.org for [email protected]
