Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/10855 )
Change subject: IMPALA-7213: Port ReportExecStatus() RPC to use KRPC ...................................................................... Patch Set 8: (22 comments) http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h File be/src/exec/data-sink.h: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h@100 PS5, Line 100: static const char* ROOT_PARTITION_KEY; > can this be a const char* const ROOT_PARTITION_KEY instead? usually static Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h File be/src/exec/hdfs-table-writer.h: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h@91 PS5, Line 91: const InsertS > style: is this meant to be returning a mutable reference or should this be Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc@254 PS5, Line 254: // If this backend completed previously, don't apply the update. : if (IsDone()) return false; : for (const FragmentInstanceExecStatusPB& instance_ > you can use a C++11 style loop with protobuf repeated elements too Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h@276 PS5, Line 276: > "... to construct a status report ..." Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h@277 PS5, Line 277: > "If 'fis' is not 'nullptr', the runtime profile ..." Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@233 PS5, Line 233: return state == BackendExecState::FINISHED : || state == BackendExecState::CANCELLED : || state == BackendExecState::E > I thought I saw some utility code for this elsewhere Good point about adding this as utility function which we don't have one right now. The utility function may not be needed eventually once we convert other ExecFInstance() RPC to KRPC too. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@243 PS5, Line 243: DCHECK(!IsTerminalState(backend_exec_state_)) : << " Current State: " << BackendExecStateToString(backend_exec_state_) : << " | Current Status: " << query_status_.Get > and here Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@256 PS5, Line 256: > Shouldn't we print the fragment instance ID instead? Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@295 PS5, Line 295: tus serialize_status = serializer-> > The profile is a thrift structure serialized to a string ... Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@306 PS5, Line 306: Only send updates to insert status if > We could probably consider moving 'profile_str' to the 'sidecar_buf' to avo I guess it's a tradeoff because moving the 'profile_str' means having to redo the serialization of the profile on when retrying the RPC. I'm a bit worried about the overhead of doing so once we merge the profile of multiple fragment instances together. Timeout is more likely to occur once the coordinator is under stress. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@304 PS5, Line 304: } : : // Only send updates to insert status if fragment is finished, the coordinator waits : // until query execution is done to use them anyhow. : RuntimeState* state = fis->runtime_state(); : if (done) { : state->dml_exec_state()->ToProto(instance_status->mutable_insert_exec_status()); : } : : // Send new errors to coordinator : state->GetUnreportedErrors(instance_status->mutable_error_log()); : } : } : > Can't we just do this once before the loop starts? Not really. We need to re-arm the sidecar on every retry. The ownership of the buffer is transferred to the OutboundCall so we need to re-arm the sidecar. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc@200 PS5, Line 200: g_lock_); > this is weird here because there is no 'exec_status' in this function Typo. Meant 'new_errors'. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc File be/src/service/control-service.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@43 PS5, Line 43: control service' > shouldn't this be "control service's" since there is only one control servi Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@89 PS5, Line 89: } > hmm, why is this repeated instead of just optional, then? That's a direct translation from the existing structure definition. I believe it was done for multi-threading (or IMPALA-4063 ?) when there could be more than one instances of a given fragment running but the conversion for multi-threading seems to be half done at this point so yes, this could have been an optional instead. With IMPALA-4063 which consolidates all fragment instances' statuses into a single RPC, we may actually be having more than one entries in instance_exec_status so I will be keeping it as repeated for now. That said, I believe this DCHECK is actually invalid after some thought. I will remove it and handle the case in which request->instance_exec_status_size() == 0; properly. In fact, Sailesh added a test case for this in this patch here: https://gerrit.cloudera.org/#/c/10813/ http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@117 PS5, Line 117: : : : > since both of the return points of thsi function have this same code, I thi Done. We actually have something like that for DataStreamService. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc File be/src/util/error-util.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc@144 PS5, Line 144: auto > auto& Done http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc@172 PS5, Line 172: auto > auto& Done http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto File common/protobuf/common.proto: http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto@43 PS5, Line 43: > do these need to be kept in sync with common/thrift/Metrics.thrift? Worth l Actually, they aren't needed in this patch anymore as we decided to keep RuntimeProfile in Thrift. They were originally ported when converting TRuntimeProfile to protobuf http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto@153 PS5, Line 153: > can you document whether this should be sent incrementally as you go (poten Yes, this is not idempotent. In fact, I notice that there may lead to inaccurate dml stats when RPC is retried due to timeout and I believe this is already a bug today. The new patch moves 'insert_exec_status' and 'error_log' into FragmentInstanceExecStatusPB instead. As we do avoid applying any updates for a fragment instance once its final report has been received (see https://github.com/apache/impala/blob/master/be/src/runtime/coordinator-backend-state.cc#L258) so we should be safe from this problem in the new code. http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift@380 PS5, Line 380: 6: optional Types.TNetworkAddress coord_address > unrelated to this patch but it always bugs me to see changes to thrift fiel Yes, unfortunately, our compatibility story is really poor. At some point, we should commit to backward compatibility for Thrift / Protobuf structures. http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py File tests/failure/test_failpoints.py: http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py@200 PS6, Line 200: > flake8: E502 the backslash is redundant between brackets Done http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py@202 PS6, Line 202: > flake8: E231 missing whitespace after ':' Done -- To view, visit http://gerrit.cloudera.org:8080/10855 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7638583b433dcac066b87198e448743d90415ebe Gerrit-Change-Number: 10855 Gerrit-PatchSet: 8 Gerrit-Owner: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Wed, 08 Aug 2018 17:39:49 +0000 Gerrit-HasComments: Yes