Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/15154 )
Change subject: IMPALA-8712: Make ExecQueryFInstances async ...................................................................... Patch Set 3: (15 comments) http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@10 PS2, Line 10: Previously, Impala would issue all the Exec()s, wait for all of them : to complete, and then check if any of them resulted in an error. We : now stop issuing Exec()s and cancel any that are still in flight as : soon as an error occurs. > so this covers IMPALA-6788 as well, right? Not completely - there's still the case where we get an error report from a running fragment while still doing startup. With this patch, we'll still wait for all Exec()s to complete before processing the error. I added a TODO with this JIRA to call this out. http://gerrit.cloudera.org:8080/#/c/15154/2//COMMIT_MSG@23 PS2, Line 23: Removing this thread pool has potential performance implications, as : it means that the Exec() parameters are serialized in serialize rather : than in parallel (with the level of parallelism determined by the size : of the thread pool, which was configurable by an Advanced flag and : defaulted to 12). > do we have a runtime counter to measure this overhead? Yes, we record the amount of time query startup takes, the time between when we mark the event "Ready to start on X backends" to when we mark "All X execution backends started" http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h File be/src/runtime/coordinator-backend-state.h: http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@82 PS2, Line 82: Return true > nit: "Returns true" Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@96 PS2, Line 96: after WaitOnExecRpc( > nit: needs to be updated Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@102 PS2, Line 102: Exec > nit: should the docs of this method be updated to indicate that this method Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@108 PS2, Line 108: imes and f > might want to mention some more docs about thread safety - e.g. is a client Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@369 PS2, Line 369: : > same as below Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.h@375 PS2, Line 375: : int64_t last_report_time_ms_ = 0; > i believe this isn't thread safe. even if it is no longer set after Exec() Went ahead and protected this with 'lock_' always. It does mean I removed some DCHECKs on it in places where we're not taking 'lock_' otherwise, but I don't think that's a big deal. http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@a452 PS2, Line 452: : : : : : : : : : : : : > looks like dead code, but are any of the TODOs still relevant? Sort of. I basically replaced this with the large comment I added earlier in this function, which addresses the same basic issue but better reflects the actual current state of things. http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@173 PS2, Line 173: > i believe we generally try to notify any waiting threads after the lock has Yes, there can potentially be multiple calls to WaitOnExec(), eg. if there are multiple calls to Coordinator::WaitOnExecRpcs(), as discussed in another comment. http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@193 PS2, Line 193: > nit: won't this be called twice, once in SetExecError and again when method Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@228 PS2, Line 228: if (IsEmptyBackend > is this thread safe? this gets set without holding onto the lock, and its p So I went ahead and fixed this by holding 'lock_' for all of the Exec() function, rather than just taking it right before sending the rpc. Its nice, because since we also hold lock_ for all of Cancel(), it significantly simplifies reasoning about possible interleavings between those functions. The downside is that we can't stop ourselves from sending an exec rpc on to backend if we've already started executing BackendState::Exec(), eg. if we realize we hit an error from a different backend while we're serializing the exec rpc params (which was possible in my original patch). That's not something we ever do now anyways. You could imagine doing it when we implement IMPALA-6788, but I think the performance benefits would be minimal vs. the additional code complexity. http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@236 PS2, Line 236: Status get_p > same issue as above, this sets exec_done_ without holding the lock Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator-backend-state.cc@853 PS2, Line 853: value->AddMember(" > should be run after acquiring the lock, or made atomic Done http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/15154/2/be/src/runtime/coordinator.cc@427 PS2, Line 427: if (exec_rpcs_complete_.Load()) return; > is the idea here that this method should only ever be executed once? whats No, this is potentially called many times, eg. its called in UpdateFilter, which is called once per runtime filter per backend producing the filter. So, the point of this branch is to provide a fast path so that we don't have to acquire BackendState::lock_ and check BackendState::exec_done_ for every backend every time its called. This function is also thread safe, whether or not exec_rpcs_complete_ has been set to true already, since exec_rpcs_complete_ is atomic and the function is idempotent. I added a mention of this in the class comment. -- To view, visit http://gerrit.cloudera.org:8080/15154 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32 Gerrit-Change-Number: 15154 Gerrit-PatchSet: 3 Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Fri, 07 Feb 2020 23:02:36 +0000 Gerrit-HasComments: Yes