I'll comment more later but as a brief addressing of the last points:

There wasn't as much of an effect on the memory usage as I expected once it got 
into the wild here. I think this is partly due to the fact that the merges get 
threaded in as soon as they are ready from the nodes, and if your skew on query 
times across your nodes is comparable to the parallel merging ability, then 
there isn't much accumulated on the brokers. So if a broker can merge 100 nodes 
in 0.5 seconds for example, and the variance in node response time on your 
historicals swings between 0.25s and 0.75s, then by the time the 0.25s ones 
have been merged in, the 0.75s ones start to stack up to be merged. I don't 
have anything to prove this hypothesis.

Starvation of the mergeFJP is a very real concern. That is why the 
`ForkJoinPool.managedBlock` is needed. In such a scenario the FJP detects that 
too many threads are blocked and adjusts its thread pool appropriately. I don't 
have strong examples of group-by queries on brokers using this approach, so I 
haven't exercised that code path yet, but the topN code path seems to be 
working as expected with this approach without encountering thread starvation.

I'll definitely check out the `ChainedExecutionQueryRunner` approach

[ Full content available at: 
https://github.com/apache/incubator-druid/pull/5913 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to