Hello Michael Ho, Sailesh Mukil, Joe McDonnell, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/8997 to look at the new patch set (#4). Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances ...................................................................... IMPALA-6338: Fix flaky test_profile_fragment_instances test_profile_fragment_instances checks that, once all the results have been returned, every fragment instance appears in the query profile for a query that internally cancels fragment instances that are still executing when the results have been fully returned. Every fis is guaranteed to send a profile to the coordinator in Finalize(), but previously fragment profiles were not applied by the coordinator if the backend was 'done', defined as either all instances having completed or one has entered an error state (including cancelled). So, the test could fail by the following sequence: - Some fragment for a particular backend sends an update to the coordinator. 'returned_all_results_' is true, so the coordinator responds indicating the the backend should cancel its remaining fragments. - Another fragment from that backend executes Finalize() and reports that it was cancelled. This causes the coordinator to consider the entire backend to be 'done'. - A third fragment, which had not previously sent a report from the reporting thread, from the same backend executes Finalize(). This report will not be applied by the coordinator as the backend is considered 'done', so this fragment will not appear in the final profile. The solution is to change the definition of 'done' to not include a backend that has been cancelled but still has fragments that haven't completed. This guarantees that for queries that complete successfully and are cancelled internally, all fis will send a report and have it applied by the coordinator before all results have been returned, since if eos is true Coordinator::GetNext() calls WaitForBackendCompletion(), which in this situation will now wait for all fis to Finalize(). Returning results for queries that are cancelled by the user is unaffected as the manual cancel path causes WaitForBackendCompletion(). Testing: - Ran test_profile_fragment_instances in a loop with no failures. I can reliably repro the original problem with a few carefully placed sleeps. Change-Id: I77773a1e3c62952003f37f88fe2b662bb11889ed --- M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M tests/query_test/test_observability.py 5 files changed, 27 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/8997/4 -- To view, visit http://gerrit.cloudera.org:8080/8997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I77773a1e3c62952003f37f88fe2b662bb11889ed Gerrit-Change-Number: 8997 Gerrit-PatchSet: 4 Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>