Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15840 )

Change subject: IMPALA-6984: coordinator cancels backends on EOS
......................................................................


Patch Set 4:

I tried to think through what queries might be vulnerable to profile counter 
values being non-deterministic because of early cancellation. Dividing queries 
into categories...

Frst, queries that return non-deterministic results, e.g. select * from table 
limit 10 from a table with more than 10 rows. These would, in most cases, 
already be flaky, because they do not return consistent results.

Second, a special case is queries that are not guaranteed to be deterministic, 
but happen to return deterministic results on the particular input data. E.g. 
my primitive_cancel_union query, or "select * from table where <predicate> 
limit 10" where the predicate returns exactly 10 rows. The case when the rows 
returned is exactly the limit is particularly flaky, because cancellation will 
race. These could definitely get flakier. Hopefully we don't have many goofy 
test queries that do things like this and also have RUNTIME_PROFILE checks. I 
guess the various runtime filter tests *do* weird things like this

Test case 12 in runtime_filters.test looks like an example of this. I could 
tweak this to increase the limits and avoid the non-determinism. I looked at 
the limit queries in bloom_filter.test and I think they're benign, because the 
RUNTIME_PROFILE checks are not testing counters in the plan subtrees that would 
be short-circuited.

Third, the plan tree of a deterministic query will generally process the whole 
input without cancelling any fragments early. E.g. queries without limits, a 
limit with an ORDER BY or an aggregation at the top of the query. We can only 
start cancelling the query once eos has been sent through the exchange. The 
profiles for the ExecNode tree should (as far as I can see) be deterministic, 
because exec_tree_->Close() is called before 
KrpcDataStreamSender::FlushFinal(), so I don't think the ExecNode tree profiles 
can change after eos is sent through the exchange.

I guess one weird thing that can happen is that the sender fragment gets 
cancelled before the EOS RPC gets acked, and FlushFinal() exits with CANCELLED 
status. As far as I can tell the code flow doesn't really change in that case, 
just the state of the fragment is CANCELLED instead of OK.

I *think* that might cover all queries. I'm going to loop the runtime profile 
tests for a while to see if anything is flaky, but I'm thinking that maybe this 
is fairly safe.


--
To view, visit http://gerrit.cloudera.org:8080/15840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I966eceaafdc18a019708b780aee4ee9d70fd3a47
Gerrit-Change-Number: 15840
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 May 2020 01:46:42 +0000
Gerrit-HasComments: No

Reply via email to