[Impala-ASF-CR] IMPALA-7213: Port ReportExecStatus() RPC to use KRPC

Michael Ho (Code Review) Wed, 08 Aug 2018 10:40:01 -0700

Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10855 )


Change subject: IMPALA-7213: Port ReportExecStatus() RPC to use KRPC
......................................................................


Patch Set 8:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h@100
PS5, Line 100:   static const char* ROOT_PARTITION_KEY;
> can this be a const char* const ROOT_PARTITION_KEY instead? usually static
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h
File be/src/exec/hdfs-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h@91
PS5, Line 91: const InsertS
> style: is this meant to be returning a mutable reference or should this be
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc@254
PS5, Line 254:   // If this backend completed previously, don't apply the 
update.
             :   if (IsDone()) return false;
             :   for (const FragmentInstanceExecStatusPB& instance_
> you can use a C++11 style loop with protobuf repeated elements too
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h@276
PS5, Line 276:
> "... to construct a status report ..."
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.h@277
PS5, Line 277:
> "If 'fis' is not 'nullptr', the runtime profile ..."
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@233
PS5, Line 233:   return state == BackendExecState::FINISHED
             :       || state == BackendExecState::CANCELLED
             :       || state == BackendExecState::E
> I thought I saw some utility code for this elsewhere
Good point about adding this as utility function which we don't have one right 
now. The utility function may not be needed eventually once we convert other 
ExecFInstance() RPC to KRPC too.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@243
PS5, Line 243:   DCHECK(!IsTerminalState(backend_exec_state_))
             :       << " Current State: " << 
BackendExecStateToString(backend_exec_state_)
             :       << " | Current Status: " << query_status_.Get
> and here
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@256
PS5, Line 256:
> Shouldn't we print the fragment instance ID instead?
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@295
PS5, Line 295: tus serialize_status = serializer->
> The profile is a thrift structure serialized to a string ...
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@306
PS5, Line 306:  Only send updates to insert status if
> We could probably consider moving 'profile_str' to the 'sidecar_buf' to avo
I guess it's a tradeoff because moving the 'profile_str' means having to redo 
the serialization of the profile on when retrying the RPC. I'm a bit worried 
about the overhead of doing so once we merge the profile of multiple fragment 
instances together. Timeout is more likely to occur once the coordinator is 
under stress.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@304
PS5, Line 304:     }
             :
             :     // Only send updates to insert status if fragment is 
finished, the coordinator waits
             :     // until query execution is done to use them anyhow.
             :     RuntimeState* state = fis->runtime_state();
             :     if (done) {
             :       
state->dml_exec_state()->ToProto(instance_status->mutable_insert_exec_status());
             :     }
             :
             :     // Send new errors to coordinator
             :     
state->GetUnreportedErrors(instance_status->mutable_error_log());
             :   }
             : }
             :
> Can't we just do this once before the loop starts?
Not really. We need to re-arm the sidecar on every retry. The ownership of the 
buffer is transferred to the OutboundCall so we need to re-arm the sidecar.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc@200
PS5, Line 200: g_lock_);
> this is weird here because there is no 'exec_status' in this function
Typo. Meant 'new_errors'.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc
File be/src/service/control-service.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@43
PS5, Line 43: control service'
> shouldn't this be "control service's" since there is only one control servi
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@89
PS5, Line 89: }
> hmm, why is this repeated instead of just optional, then?
That's a direct translation from the existing structure definition.  I believe 
it was done for multi-threading (or IMPALA-4063 ?) when there could be more 
than one instances of a given fragment running but the conversion for 
multi-threading seems to be half done at this point so yes, this could have 
been an optional instead.

With IMPALA-4063 which consolidates all fragment instances' statuses into a 
single RPC, we may actually be having more than one entries in 
instance_exec_status so I will be keeping it as repeated for now.

That said, I believe this DCHECK is actually invalid after some thought. I will 
remove it and handle the case in which request->instance_exec_status_size() == 
0; properly. In fact, Sailesh added a test case for this in this patch here: 
https://gerrit.cloudera.org/#/c/10813/


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@117
PS5, Line 117:
             :
             :
             :
> since both of the return points of thsi function have this same code, I thi
Done. We actually have something like that for DataStreamService.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc
File be/src/util/error-util.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc@144
PS5, Line 144: auto
> auto&
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/util/error-util.cc@172
PS5, Line 172: auto
> auto&
Done


http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto
File common/protobuf/common.proto:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto@43
PS5, Line 43:
> do these need to be kept in sync with common/thrift/Metrics.thrift? Worth l
Actually, they aren't needed in this patch anymore as we decided to keep 
RuntimeProfile in Thrift. They were originally ported when converting 
TRuntimeProfile to protobuf


http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto@153
PS5, Line 153:
> can you document whether this should be sent incrementally as you go (poten
Yes, this is not idempotent. In fact, I notice that there may lead to 
inaccurate dml stats when RPC is retried due to timeout and I believe this is 
already a bug today.

The new patch moves 'insert_exec_status' and 'error_log' into 
FragmentInstanceExecStatusPB instead. As we do avoid applying any updates for a 
fragment instance once its final report has been received (see 
https://github.com/apache/impala/blob/master/be/src/runtime/coordinator-backend-state.cc#L258)
 so we should be safe from this problem in the new code.


http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift@380
PS5, Line 380:   6: optional Types.TNetworkAddress coord_address
> unrelated to this patch but it always bugs me to see changes to thrift fiel
Yes, unfortunately, our compatibility story is really poor. At some point,  we 
should commit to backward compatibility for Thrift / Protobuf structures.


http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py
File tests/failure/test_failpoints.py:

http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py@200
PS6, Line 200:
> flake8: E502 the backslash is redundant between brackets
Done


http://gerrit.cloudera.org:8080/#/c/10855/6/tests/failure/test_failpoints.py@202
PS6, Line 202:
> flake8: E231 missing whitespace after ':'
Done



--
To view, visit http://gerrit.cloudera.org:8080/10855
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7638583b433dcac066b87198e448743d90415ebe
Gerrit-Change-Number: 10855
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Wed, 08 Aug 2018 17:39:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7213: Port ReportExecStatus() RPC to use KRPC

Reply via email to