Juan Yu has posted comments on this change.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


Patch Set 20:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/3343/20//COMMIT_MSG
Commit Message:

Line 12: Impala doesn't set socket send/recv timeout for backend client.
> Prior to this change, ...
Done


PS20, Line 24: new RPC call RetryRpcRecv()
> This makes it sound like RetryRpcRecv() is a new RPC.  I haven't looked at 
Done


PS20, Line 28: call
> delete (C of RPC stands for call)
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

PS20, Line 102: call
> delete
Done


PS20, Line 102: which trigger caller hits RPC timeout.
> to trigger an RPC timeout on the caller side.
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.cc
File be/src/runtime/client-cache.cc:

Line 162:   client_map_.erase(client);
> is it legal to use the 'client' iterator even though the lock was dropped i
http://kera.name/articles/2011/06/iterator-invalidation-rules/
For insert, iterator is not affected. For Erasure, iterator is only affected to 
the erased elements.
client is not shared so there shouldn't be more than one thread try to destroy 
the same client. right?


Line 163:   *client_key = NULL;
> why doesn't this method need to remove from the per_host_caches_ as well? T
Connection cannot be shared. when a connection is being use, it was taken out 
from per_host_caches. so we just don't put it back.
The header comment means client_map_. I'll clarify that.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h
File be/src/runtime/client-cache.h:

PS20, Line 96: it
> what is "it"? the client? or the connection?  does this NULL out *client_ke
"it" means the client.  DestroyClient NULL out *client_key like Release Client()


PS20, Line 96: cache
> is this "the per-host cache" that is referenced everywhere else, or somethi
Done


PS20, Line 224: indicate
> indicates
if returned status is okay, means both RPC are completed. can or cannot retry 
really depends on the data being sent here.
If this is TransmiteData, retry is not safe
but if this is ReportStatusCb, retry is fine.
Only caller knows that info.


Line 225:   /// the whole RPC call.
> explain this more.  is it because if the send RPC call failed, then the han
that's right.


Line 260:       if (!status.ok()) {
> so this is the only path where we have *retry_is_safe == true and an error 
Done


Line 265:       } catch (apache::thrift::TException& e) {
> why does this path not need the timeout check also?
I think the original logic is to give it a second chance but if it fails again, 
just fail the whole RPC.


Line 271:     client_is_unrecoverable_ = false;
> why is this logic needed now, whereas it wasn't needed before? or is this a
yes, it's another bug fix, I need to fix it in this patch, otherwise, the 
broken connection will cause lots of problem.
IMPALA-2864


PS20, Line 301: the
> whether the
see my previous patch and Henry's comment
"last_rpc_succeeded_" matches the code logic better.
"client_is_unrecoverable_" to more clearly describe the state
not sure which one is better.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

Line 115:   ImpalaBackendClientCache* client_cache_;
> now that we store the RuntimeState, let's get rid of this and just get the 
Done


PS20, Line 153: wait
> retry waiting for the response?
Done


PS20, Line 153: call
> delete "call", and would be nice to write RPC
Done


PS20, Line 231: timeout_ms
> where is this called with a non-zero timeout_ms?
For now, we like to wait indefinitely to avoid logic change.
But in the future, even after replace RPC implementation, it's better to have a 
reasonable timeout for all RPCs.


Line 239:               timeout_ms));
> this will throw away the original status message.  Maybe use AddDetail() in
Done


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-exec-state.cc
File be/src/service/fragment-exec-state.cc:

Line 125:     if (!retry_is_safe) break;
> so i guess this was a bug in the old code?
yes, But for ReportExecStatus RPC, resend shouldn't cause any issue. handler 
just process the same status multiple times.


http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

PS20, Line 658: Rewrite
> Add details to
Done


PS20, Line 669: Rewrite
> same
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 20
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Alan Choi <a...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com>
Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to