[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 4: This change did not cherrypick successfully into branch 2.x. To resolve this, please do the cherry-pick manually and submit it to Gerrit at refs/for/2.x or add an exception to the branch 2.x copy of bin/ignored_commits.json. Thanks, your friendly bot at https://jenkins.impala.io/job/cherrypick-2.x-and-test/520/ . -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 4 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 19 May 2018 23:10:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed - loop query_test & failure for 12 hours with no dchecks or crashes (This had previously reproduced IMPALA-7030 and IMPALA-7033 with the previous version of this change). Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Reviewed-on: http://gerrit.cloudera.org:8080/10440 Reviewed-by: Dan HechtTested-by: Impala Public Jenkins --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 396 insertions(+), 391 deletions(-) Approvals: Dan Hecht: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 4 Gerrit-Owner: Dan Hecht Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 3: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 3 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 18 May 2018 04:36:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 3 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 18 May 2018 01:22:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2502/ -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 3 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 18 May 2018 01:23:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 2 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 23:26:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 2: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@695 PS1, Line 695: is_fragment_failure ? > as mentioned in this thread, it will be no worse than before this change. I sounds good. Thanks for the explanation. -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 2 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 22:30:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10440 to look at the new patch set (#2). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed - loop query_test & failure for 12 hours with no dchecks or crashes (This had previously reproduced IMPALA-7030 and IMPALA-7033 with the previous version of this change). Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 396 insertions(+), 391 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/10440/2 -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 2 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h@351 PS1, Line 351: Cannot finalize execution until exec RPCs are no longer : /// being sent. > We added a DCHECK, but this comment suggests that this method will ensure t Yeah, I can see how the comment is confusing. Updated the comments. (Ultimately, in a future change I likely will move the wait() but want to hold off on that to "prove" that the wait is only needed for the UpdateBackendExecStatus() call path) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@530 PS1, Line 530: // TODO: IMPALA-7011 is this needed? This will also happen on the "backend" side of : // cancel rpc. And in the case of EOS, sink already knows this. > I think we can update this comment, since we figured out that we do need it Done. (Though I haven't given up on this quite yet -- I think it's needed given the current plan-root-sink code, but it might be possible to fix that code to make it not needed. But the JIRA can track that work). http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@695 PS1, Line 695: exec_rpcs_complete_barrier_->Wait(); > should we be concerned about adding a possible long wait() call on an RPC c as mentioned in this thread, it will be no worse than before this change. In the old code, we hold the Coordinator::lock_ while sending the exec RPCs. Additionally, we take the Coordinator::lock_ inside UpdateStatus() and also at line 688 -- so this RPC handler will always block until the exec rpcs are done sending in the old code. In the new code, the blocking only happens when the query hits an error. I agree it's not ideal, but we're better off. And I plan to improve things with IMPALA-6788. -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 22:00:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h@351 PS1, Line 351: Cannot finalize execution until exec RPCs are no longer : /// being sent. > this comment is new We added a DCHECK, but this comment suggests that this method will ensure that finalization fails(like returns a non-ok status) if exec RPCs are pending. we should either move the wait() call to this method or move this comment to UpdateBackendExecStatus() and update the one for this to say that "It should not be called if exec RPCs are pending" http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@530 PS1, Line 530: // TODO: IMPALA-7011 is this needed? This will also happen on the "backend" side of : // cancel rpc. And in the case of EOS, sink already knows this. I think we can update this comment, since we figured out that we do need it. http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@695 PS1, Line 695: exec_rpcs_complete_barrier_->Wait(); should we be concerned about adding a possible long wait() call on an RPC callback? -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 21:48:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@693 PS1, Line 693: // We may start receiving status reports before all exec rpcs are complete. : // Can't apply state transition until no more exec rpcs will be sent. : exec_rpcs_complete_barrier_->Wait(); > Just to make sure I understand the bug correctly, we only need to wait for For the bug we were hitting. But we also need to wait for all exec rpcs to avoid the case of the cancel rpc passing the exec rpc for the other backends (the new comment in HandleExecStateTransition is meant to explain that - lmk if it needs clarification). I think what we'll want to do is preserve this barrier, but be able to skip sending exec rpcs when there is an error to unblock this barrier sooner. -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 19:16:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: Code-Review+1 (1 comment) Bikram may want to take a look too. http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@693 PS1, Line 693: // We may start receiving status reports before all exec rpcs are complete. : // Can't apply state transition until no more exec rpcs will be sent. : exec_rpcs_complete_barrier_->Wait(); > Also note that we are still better off than before this patch because previ Just to make sure I understand the bug correctly, we only need to wait for exec RPCs for this backend, right? I.e. waiting for all exec RPCs is sufficient but not necessary. Not suggesting we need to change it. -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 17 May 2018 18:14:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@693 PS1, Line 693: // We may start receiving status reports before all exec rpcs are complete. : // Can't apply state transition until no more exec rpcs will be sent. : exec_rpcs_complete_barrier_->Wait(); > this is the fix for IMPALA-7030/IMPALA-7033 race. Also, I plan to add a reg Also note that we are still better off than before this patch because previously, the Coordinator::lock_ was held while sending exec rpcs and also taken by this callback, and so all status report RPCs would block until we finish exec rpcs. Now, we only block a failure status. Also, i plan to improve this further with IMPALA-6788 (which will stop sending exec rpcs once we get the first failure report or exec rpc fails). -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Dan Hecht Gerrit-Comment-Date: Thu, 17 May 2018 17:48:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10440 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 1: (4 comments) This is the same as the previous reviewed patch at https://gerrit.cloudera.org/#/c/10158/ except where noted in this comment. I backed out the change in order to add a fix for IMPALA-7030/7033. http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.h@351 PS1, Line 351: Cannot finalize execution until exec RPCs are no longer : /// being sent. this comment is new http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@523 PS1, Line 523: // Can't transition until the exec RPCs are no longer in progress. Otherwise, a : // cancel RPC could be missed, and resources freed before a backend has had a chance : // to take a resource reference. : DCHECK(exec_rpcs_complete_barrier_ != nullptr && : exec_rpcs_complete_barrier_->pending() <= 0) << "exec rpcs not completed"; this DCHECK is new http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@664 PS1, Line 664: PrintId(query_id()) I folded in the PrintId() into this version of the change. http://gerrit.cloudera.org:8080/#/c/10440/1/be/src/runtime/coordinator.cc@693 PS1, Line 693: // We may start receiving status reports before all exec rpcs are complete. : // Can't apply state transition until no more exec rpcs will be sent. : exec_rpcs_complete_barrier_->Wait(); this is the fix for IMPALA-7030/IMPALA-7033 race. Also, I plan to add a regression test specifically for this race with IMPALA-7046. -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan HechtGerrit-Reviewer: Dan Hecht Gerrit-Comment-Date: Thu, 17 May 2018 17:45:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10440 Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed - loop query_test & failure for 12 hours with no dchecks or crashes (This had previously reproduced IMPALA-7030 and IMPALA-7033 with the previous version of this change). Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 398 insertions(+), 391 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/10440/1 -- To view, visit http://gerrit.cloudera.org:8080/10440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I6dc08da1295f1df3c9dce6d35d65d887b2c00a1c Gerrit-Change-Number: 10440 Gerrit-PatchSet: 1 Gerrit-Owner: Dan Hecht
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 22: This change did not cherrypick successfully into branch 2.x. To resolve this, please do the cherry-pick manually and submit it to Gerrit at refs/for/2.x or add an exception to the branch 2.x copy of bin/ignored_commits.json. Thanks, your friendly bot at https://jenkins.impala.io/job/cherrypick-2.x-and-test/486/ . -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 22 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 12 May 2018 07:10:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Reviewed-on: http://gerrit.cloudera.org:8080/10158 Reviewed-by: Dan HechtTested-by: Impala Public Jenkins --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 393 insertions(+), 392 deletions(-) Approvals: Dan Hecht: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 22 Gerrit-Owner: Dan Hecht Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 21: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 21 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 12 May 2018 00:46:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 21: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@476 PS18, Line 476: DCHECK(exec_status_.ok()) << exec_status_; > I was thinking the ($3 -> $4) already gives that info, once you get used to Agreed. That makes sense. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 21 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 23:59:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 21: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2455/ -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 21 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 21:18:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 21: Code-Review+2 Rebase -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 21 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 21:18:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@476 PS18, Line 476: DCHECK(exec_status_.ok()) << exec_status_; > Maybe we can have a larger discussion later, and leave this as it is now. B I was thinking the ($3 -> $4) already gives that info, once you get used to it. For an error that results in query error, it will look like: (EXECUTING -> ERROR) for other errors (that are effectively dropped), it will look like e.g.: (CANCELLED -> CANCELLED) (RETURNED_RESULTS -> RETURNED_RESULTS) (ERROR -> ERROR) which says that we didn't enter the error state for this status. I'm definitely open to tweaking the condition for this log in the future, but just thinking given the amount of active development in the code that will call this path (i.e. IMPALA-2990), it might be helpful to have at least initially. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 18 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 21:16:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 20: Code-Review+2 (4 comments) Thanks! This LGTM, +2-ing it since it's already gone through multiple rounds of review. All comments are just responses for discussion's sake, and don't require any code changes. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h@325 PS18, Line 325: rySummary() and > ExecState is a scalar (enum). We pass scalar by value - no reason to have i I agree but on the other hand, keeping it const would help by avoiding accidental bugs such as overwriting the exec state in future changes, so it's better just to define the read only contract upfront. Also agree with no need for a ref. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@279 PS14, Line 279: > Cancel() is not called for (execution) errors. Thats part of the point of t I think I meant CancelBackends() as the NotifyAll() lives there, but the comment you updated it to is much clearer. Thanks. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@457 PS18, Line 457: ret_status > exec_status_ is protected by the lock, so the copy is necessary. Fair point, I missed that. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@476 PS18, Line 476: DCHECK(exec_status_.ok()) << exec_status_; > My thinking was that it would be good to at least log the error, which is w Maybe we can have a larger discussion later, and leave this as it is now. But just as someone who frequents the logs to debug issues, if I see an error for a query, I would assume by default that the query failed, which could lead to some confusion, if it in fact didn't fail. Maybe a solution would be to meet in the middle and tag log messages such as this with something like " failed but query succeeded", etc. But we don't need to have that convo in this patch. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 20 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 21:11:01 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@457 PS18, Line 457: ret_status > Just fyi, there's no copy because return-value-optimisation applies to ret_ RVO eliminates the copy of ret_value into the thing that is accepting it, but there's still a copy from exec_status_ while holding the lock. But, yeah, agree with your point that it ultimately we end up with one copy from exec_status_ in both cases. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 18 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 17:15:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@457 PS18, Line 457: ret_status > exec_status_ is protected by the lock, so the copy is necessary. Just fyi, there's no copy because return-value-optimisation applies to ret_status. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 18 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 16:57:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 20: Code-Review+1 Carry +1s -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 20 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 16:44:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 20: (12 comments) Thanks for taking a look Sailesh. http://gerrit.cloudera.org:8080/#/c/10158/19/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/19/be/src/runtime/coordinator.h@93 PS19, Line 93: /// 2. exec_state_lock_, backend_states_init_lock_, filter_lock_, > What about backend_states_init_lock_ ? Done http://gerrit.cloudera.org:8080/#/c/10158/19/be/src/runtime/coordinator.h@206 PS19, Line 206: > comment requires an update. referenced the class level comment. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h@325 PS18, Line 325: rySummary() and > const ref ExecState is a scalar (enum). We pass scalar by value - no reason to have indirection. I can make them const though, if you prefer, given these are leafs routines. Generally, I'm not a huge fan of const scalar args because it leads to const creep with little benefit (compiler has to perform analysis anyway, and usually easy enough to see when non-pointers are not modified/aliased). http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h@350 PS18, Line 350: in all cases releases resources and cal > These look like they're read only, so they should be passed as const refs. Scalar and in registers, so not going to pass by reference. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h@357 PS18, Line 357: > const ref made it const http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.h@360 PS18, Line 360: > const ref made it const http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@127 PS14, Line 127: (); > You mean Cancel() and UpdateBackendExecStatus()? Maybe this is not worth me This comment was pre-existing but I'm fine with removing it. It's covered in the class comment. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@279 PS14, Line 279: > To be more explicit, we should say "... or when Cancel() is called." Cancel() is not called for (execution) errors. Thats part of the point of the patch is to distinguish canceled state and error state. (It's true that the impala server layer can implement an error at that layer using Cancel(), but that's not an execution error, and I'd rather not muddle the terminology). Will update this comment to talk about backend cancellation, though. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@457 PS18, Line 457: ret_status > Preferable to get rid of the automatic variable 'ret_status' and just retur exec_status_ is protected by the lock, so the copy is necessary. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@462 PS18, Line 462: Status ret_status; > Same here. Same locking. http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@476 PS18, Line 476: DCHECK(exec_status_.ok()) << exec_status_; > If an error happens after exec_state_ is already set to ExecState::RETURNED My thinking was that it would be good to at least log the error, which is why I wrote the code that way. Alternatively, we could only log when a state transition happens, but it seems like it might be good to know when an backend error occurs even if not failing the query, but I'm also not a fan of log spew. Are you arguing for one way or the other? http://gerrit.cloudera.org:8080/#/c/10158/18/be/src/runtime/coordinator.cc@486 PS18, Line 486: !status.ok() && !status.IsCancelled() && e > (!status.ok() && !status.IsCancelled() && exec_status_.IsCancelled()) This path should be rare, but sure we can avoid the copy in that case. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 20 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 11 May 2018 16:37:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#20). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 393 insertions(+), 392 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/20 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 20 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/19/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/19/be/src/runtime/coordinator.cc@599 PS19, Line 599: WaitForBackends(); somewhere along the way I broke Kudu DML. Moving this back outside the if-stmt fixes that. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 19 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 23:36:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#19). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 387 insertions(+), 391 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/19 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 19 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 18: Code-Review+1 Thanks for the reviews! rebase and carry. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 18 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 17:10:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#17). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 389 insertions(+), 391 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/17 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 17 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: (3 comments) http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@522 PS15, Line 522: query_events_->MarkEvent(exec_state_to_event.at(new_state)); : // TODO: is this needed? This will also happen on the "backend" side of cancel rpc. : // And in the case of EOS, sink already knows this. > makes sense. I filed IMPALA-7011 for that. http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc@132 PS16, Line 132: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc@618 PS16, Line 618: > concurrently Done -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 14 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 17:08:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 16: Code-Review+1 (3 comments) http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@522 PS15, Line 522: // TODO: is this needed? This will also happen on the "backend" side of cancel rpc. : // And in the case of EOS, sink already knows this. : if (coord_sink_ != nullptr) coord_sink_->CloseConsumer(); > Yeah, I agree, but would prefer to do it in an independent commit in case t makes sense. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@943 PS15, Line 943: e : backend_states_) { > In principle, I agree that's how we'd want it to work. But given the way Cl sounds good http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc@132 PS16, Line 132: nit: extra space -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 16 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 02:16:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 16: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/16/be/src/runtime/coordinator.cc@618 PS16, Line 618: currently concurrently -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 16 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 00:47:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@291 PS14, Line 291: ExecState exec_state_ = ExecState::EXECUTING; > I agree it's a little confusing, but there is nothing you can do with this I think the changes to the header comment made it clearer. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 14 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 10 May 2018 00:31:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 16: (13 comments) http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@322 PS15, Line 322: cu > double space snuck back in Done http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@341 PS15, Line 341: ith 'instance_ho > nit: update to "failed_finstance" Done http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@414 PS15, Line 414: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@a115 PS14, Line 115: : > put that back Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@121 PS14, Line 121: may call any > should we call this a non-OK status, because for cancellation through Cance Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@131 PS14, Line 131: > do you mean cancels? I think I meant to delete that. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@147 PS15, Line 147: Status prepare_status = query_state_->WaitForPrepare(); : DCHECK(!prepare_status.ok()); : return UpdateExecState > should we call UpdateExecStateIfError() here? yes, done. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@486 PS15, Line 486: !status.ok() && exec_status_.IsCancelled() > can we have a case with= ? Ideally no. but I think the backend can sometimes do it due to complications in the error propagation path. I would like to get to a point where this can't be cancelled, but I don't think we're there yet. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@522 PS15, Line 522: // TODO: is this needed? This will also happen on the "backend" side of cancel rpc. : // And in the case of EOS, sink already knows this. : if (coord_sink_ != nullptr) coord_sink_->CloseConsumer(); > I dont think we need this, I looked at why we originally needed this and it Yeah, I agree, but would prefer to do it in an independent commit in case there is some unforeseen issue. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@530 PS15, Line 530: : } else { > in case of cancelBackends(), should we also wait for resources to be freed? I think the argument can be made in both directions, but we eventually settled on the current behavior. The reasoning is documented in the header comment for ReleaseAdmissionControlResources(). I'll update this todo as to not add confusion. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@605 PS15, Line 605: > maybe change the name to something like UpdateExecState, since this is also The "if error" was meant to say that the exec state is only updated if the status is an not-ok and only to move into an error state (as opposed to, say, cancell state). But I can see the confusion so i'll change it since you feel it'd be clearer. I guess the header comment should clarify that this shouldn't be used for cancellation. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@639 PS15, Line 639: DCHECK(exec_rpcs_complete_barrier_ != nullptr && : exec_rpcs_complete_barrier_->pending() <= 0) << "Exec() must be called first"; : discard_result(SetNonErrorTerminalSt > does this mean we can call cancel before Exec() completes? what if cancel i Yeah, this is bogus. Fixed. The comments in the header already say Exec() needs to precede this. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@943 PS15, Line 943: e : backend_states_) { > Do we expect these methods to be called before Exec() completes? If not, we In principle, I agree that's how we'd want it to work. But given the way ClientRequestState currently works and that things like the webserver just reach in and get the coord() object, I think they can currently call these webserver support rourtines before Exec() completes. How about I revisit this after your patch is merged? -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#16). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 389 insertions(+), 391 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/16 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 16 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 15: (11 comments) http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@341 PS15, Line 341: failed_fragment' nit: update to "failed_finstance" http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@414 PS15, Line 414: nit: extra space http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@121 PS14, Line 121: s not thread should we call this a non-OK status, because for cancellation through Cancel() we should not call it error http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@131 PS14, Line 131: tate_ do you mean cancels? http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@147 PS15, Line 147: Status prepare_status = query_state_->WaitForPrepare(); : DCHECK(!prepare_status.ok()); : return prepare_status; should we call UpdateExecStateIfError() here? http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@486 PS15, Line 486: !status.ok() && exec_status_.IsCancelled() can we have a case with= ? http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@522 PS15, Line 522: // TODO: is this needed? This will also happen on the "backend" side of cancel rpc. : // And in the case of EOS, sink already knows this. : if (coord_sink_ != nullptr) coord_sink_->CloseConsumer(); I dont think we need this, I looked at why we originally needed this and it was because the fragment lifecycle was a little different and we needed to to close the sink in case the Prepare phase of the fragments failed. After patch IMPALA-2550, we handle partially-prepared state failure differently and hence dont require this anymore. Moreover, since this was used to close partially-prepared state failure and now if we fail prepare, coord_sink_ is not even set, so this is essentially redundant. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@530 PS15, Line 530: ensure resources are : // really freed before admitting more queries? in case of cancelBackends(), should we also wait for resources to be freed? http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@605 PS15, Line 605: UpdateExecStateIfError maybe change the name to something like UpdateExecState, since this is also used to only get the overall status and it may or may not be an error status http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@639 PS15, Line 639: // Wait until backends have started executing, otherwise the cancel rpc might arrive : // before the exec rpc. : exec_rpcs_complete_barrier_->Wait(); does this mean we can call cancel before Exec() completes? what if cancel is called before exec_rpcs_complete_barrier_ is initialized and is still nullptr. We should probably just mark this method to only be called after Exec completes because currently that how it is used. http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.cc@943 PS15, Line 943: backend_states_init_lock_ Do we expect these methods to be called before Exec() completes? If not, we can get rid of backend_states_init_lock_ and mark these methods to only be called after Exec() -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 15 Gerrit-Owner: Dan Hecht Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 09 May 2018 20:55:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: (2 comments) http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@a115 PS14, Line 115: : put that back http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.cc@623 PS14, Line 623: TOOD > Inspection of the ClientRequestState code. I think it's already true, but i It's IMPALA-4580. I'm not gonna muck with that now -- will revert part of the header comment to keep this behavior documented. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 14 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 09 May 2018 03:41:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/15/be/src/runtime/coordinator.h@322 PS15, Line 322: double space snuck back in -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 15 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 09 May 2018 03:24:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: (33 comments) http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@23 PS14, Line 23: #include > Not your change but these accumulators don't look like they're used. moved to coordinator-backend-state.h, where they are used. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@33 PS14, Line 33: #include > Not used? Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@37 PS14, Line 37: #include "common/hdfs.h" > Move to .cc? Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@42 PS14, Line 42: #include "runtime/query-state.h" > I think you could move this to the .cc - QueryState is already forward-decl to .cc and coordinator-backend-state.h http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@43 PS14, Line 43: #include "scheduling/query-schedule.h" > Could forward-declare QuerySchedule and move this to the .cc. query_ctx() was relying on that so moved that to .cc (and moved query_id() back). http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@87 PS14, Line 87: /// 1. When an error is encountered while still executing, backend cancellation is > Can we define what "executing" means? In this context it just means hasn't encountered either of these three conditions. I've tried to clarify by rewording. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@99 PS14, Line 99: /// > Should we document the lock order? Coordinator::lock_ is mentioned in the I I deleted that impala-server.h comment, lmk if you disagree. Added short comment about ordering. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@114 PS14, Line 114: > extra spaces in this comment Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@130 PS14, Line 130: /// Cancel execution of query and sets the overall query status to CANCELLED if the > Long line Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@191 PS14, Line 191: accesses > accessing. Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@214 PS14, Line 214: boost::mutex wait_lock_; > Switch to SpinLock? The inconsistency might be confusing otherwise. Probabl Let me do that in a follow up commit. It should be fine but could have some weird subtle side effect. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@218 PS14, Line 218: /// Keeps track of number of completed ranges and total scan ranges. Initialized by Exec(). > Long line. Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@219 PS14, Line 219: ProgressUpdater progress_; > I don't feel that strongly but it seems like we could PIMPL this and avoid Given you don't feel strongly (nor do I) and it's not related to this change and the change is large enough (in terms of complexity) let's defer that for now. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@291 PS14, Line 291: ExecState exec_state_ = ExecState::EXECUTING; > It's a little confusing that this is EXECUTING before Exec() is called. The I agree it's a little confusing, but there is nothing you can do with this object between construction and calling Exec() so not sure it's worth making a distinct state (the implementation is also simplified given that all transformations result in a terminal state). Maybe I can clarify with comments -- Which part are you referring to about the distinction? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@325 PS14, Line 325: /// If 'exec_state_' != EXECUTING, does nothing. Otherwise sets 'exec_state_' to > Maybe a brief high-level description would be helpful. E.g. "Called when th Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@326 PS14, Line 326: /// 'state' (must be either CANCELLED or RETURNED_RESULTS), > Formatting of this comment seems weird. not sure what happened there. fixed. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@333 PS14, Line 333: /// Transitions 'exec_state_' given an execution status: > It's a little hard to decode why someone would call this with an OK status I moved the last line up to the beginning. Hopefully that helps clarify. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@351 PS14, Line 351: > extra space Done http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@360 PS14, Line 360: void > void argument list not needed in C++ - ()
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#15). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/service/impala-server.h M be/src/util/counting-barrier.h 6 files changed, 391 insertions(+), 390 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/15 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 15 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: (33 comments) I think this makes sense. I really need to do another pass to verify my understanding. I have a lot of comments but they're mainly cleanup opportunities and about comments. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@23 PS14, Line 23: #include Not your change but these accumulators don't look like they're used. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@33 PS14, Line 33: #include Not used? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@37 PS14, Line 37: #include "common/hdfs.h" Move to .cc? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@42 PS14, Line 42: #include "runtime/query-state.h" I think you could move this to the .cc - QueryState is already forward-declared. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@43 PS14, Line 43: #include "scheduling/query-schedule.h" Could forward-declare QuerySchedule and move this to the .cc. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@87 PS14, Line 87: /// 1. When an error is encountered while still executing, backend cancellation is Can we define what "executing" means? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@99 PS14, Line 99: /// Should we document the lock order? Coordinator::lock_ is mentioned in the ImpalaServer header but there are multiple locks in here. Or is it sufficient to document that individual locks are terminal? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@114 PS14, Line 114: extra spaces in this comment http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@130 PS14, Line 130: /// Cancel execution of query and sets the overall query status to CANCELLED if the Long line http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@191 PS14, Line 191: accesses accessing. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@214 PS14, Line 214: boost::mutex wait_lock_; Switch to SpinLock? The inconsistency might be confusing otherwise. Probably doesn't matter that much... http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@218 PS14, Line 218: /// Keeps track of number of completed ranges and total scan ranges. Initialized by Exec(). Long line. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@219 PS14, Line 219: ProgressUpdater progress_; I don't feel that strongly but it seems like we could PIMPL this and avoid including the header above. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@291 PS14, Line 291: ExecState exec_state_ = ExecState::EXECUTING; It's a little confusing that this is EXECUTING before Exec() is called. There's a distinction made above about being in an "Executing" state, which is distinct from this being EXECUTING. Maybe there should be a different initial state to make that part of the state machine explicit? http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@325 PS14, Line 325: /// If 'exec_state_' != EXECUTING, does nothing. Otherwise sets 'exec_state_' to Maybe a brief high-level description would be helpful. E.g. "Called when the query is done executing in a non-error state." http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@326 PS14, Line 326: /// 'state' (must be either CANCELLED or RETURNED_RESULTS), Formatting of this comment seems weird. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@333 PS14, Line 333: /// Transitions 'exec_state_' given an execution status: It's a little hard to decode why someone would call this with an OK status based on the comment - at least until you get all the ways to the last line of the comment. I'm not sure that I have a concrete suggestion since the comment is already quite long, but maybe you have an idea. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@351 PS14, Line 351: extra space http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@360 PS14, Line 360: void void argument list not needed in C++ - () is equivalent to (void) unlike C. http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.h@406 PS14, Line 406: extra space http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10158/14/be/src/runtime/coordinator.cc@54 PS14, Line 54: using namespace strings;
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10158 ) Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. Patch Set 14: I think this is ready for review now. -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 14 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 08 May 2018 18:49:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5384, part 2: Simplify Coordinator locking and clarify state
Hello Michael Ho, Sailesh Mukil, Tim Armstrong, Bikramjeet Vig, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10158 to look at the new patch set (#14). Change subject: IMPALA-5384, part 2: Simplify Coordinator locking and clarify state .. IMPALA-5384, part 2: Simplify Coordinator locking and clarify state The is the final change to clarify and break up the Coordinator's lock. The state machine for the coordinator is made explicit, distinguishing between executing state and multiple terminal states. Logic to transition into a terminal state is centralized in one location and executes exactly once for each coordinator object. Derived from a patch for IMPALA_5384 by Marcel Kornacker. Testing: - exhaustive functional tests - stress test on minicluster with memory overcommitment. Verified from the logs that this exercises all these paths: - successful queries - client requested cancellation - error from exec FInstances RPC - error reported asynchronously via report status RPC - eos before backend execution completed Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e --- M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/service/client-request-state.cc M be/src/util/counting-barrier.h 4 files changed, 368 insertions(+), 370 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/10158/14 -- To view, visit http://gerrit.cloudera.org:8080/10158 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1abdfd02163f9356c59d470fe1c64ebe012a9e8e Gerrit-Change-Number: 10158 Gerrit-PatchSet: 14 Gerrit-Owner: Dan HechtGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong