Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15840 )
Change subject: IMPALA-6984: coordinator cancels backends on EOS ...................................................................... Patch Set 4: I tried to think through what queries might be vulnerable to profile counter values being non-deterministic because of early cancellation. Dividing queries into categories... Frst, queries that return non-deterministic results, e.g. select * from table limit 10 from a table with more than 10 rows. These would, in most cases, already be flaky, because they do not return consistent results. Second, a special case is queries that are not guaranteed to be deterministic, but happen to return deterministic results on the particular input data. E.g. my primitive_cancel_union query, or "select * from table where <predicate> limit 10" where the predicate returns exactly 10 rows. The case when the rows returned is exactly the limit is particularly flaky, because cancellation will race. These could definitely get flakier. Hopefully we don't have many goofy test queries that do things like this and also have RUNTIME_PROFILE checks. I guess the various runtime filter tests *do* weird things like this Test case 12 in runtime_filters.test looks like an example of this. I could tweak this to increase the limits and avoid the non-determinism. I looked at the limit queries in bloom_filter.test and I think they're benign, because the RUNTIME_PROFILE checks are not testing counters in the plan subtrees that would be short-circuited. Third, the plan tree of a deterministic query will generally process the whole input without cancelling any fragments early. E.g. queries without limits, a limit with an ORDER BY or an aggregation at the top of the query. We can only start cancelling the query once eos has been sent through the exchange. The profiles for the ExecNode tree should (as far as I can see) be deterministic, because exec_tree_->Close() is called before KrpcDataStreamSender::FlushFinal(), so I don't think the ExecNode tree profiles can change after eos is sent through the exchange. I guess one weird thing that can happen is that the sender fragment gets cancelled before the EOS RPC gets acked, and FlushFinal() exits with CANCELLED status. As far as I can tell the code flow doesn't really change in that case, just the state of the fragment is CANCELLED instead of OK. I *think* that might cover all queries. I'm going to loop the runtime profile tests for a while to see if anything is flaky, but I'm thinking that maybe this is fairly safe. -- To view, visit http://gerrit.cloudera.org:8080/15840 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I966eceaafdc18a019708b780aee4ee9d70fd3a47 Gerrit-Change-Number: 15840 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 12 May 2020 01:46:42 +0000 Gerrit-HasComments: No