Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ......................................................................
Patch Set 20: (23 comments) http://gerrit.cloudera.org:8080/#/c/3343/20//COMMIT_MSG Commit Message: Line 12: Impala doesn't set socket send/recv timeout for backend client. > Prior to this change, ... Done PS20, Line 24: new RPC call RetryRpcRecv() > This makes it sound like RetryRpcRecv() is a new RPC. I haven't looked at Done PS20, Line 28: call > delete (C of RPC stands for call) Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/common/global-flags.cc File be/src/common/global-flags.cc: PS20, Line 102: call > delete Done PS20, Line 102: which trigger caller hits RPC timeout. > to trigger an RPC timeout on the caller side. Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 162: client_map_.erase(client); > is it legal to use the 'client' iterator even though the lock was dropped i http://kera.name/articles/2011/06/iterator-invalidation-rules/ For insert, iterator is not affected. For Erasure, iterator is only affected to the erased elements. client is not shared so there shouldn't be more than one thread try to destroy the same client. right? Line 163: *client_key = NULL; > why doesn't this method need to remove from the per_host_caches_ as well? T Connection cannot be shared. when a connection is being use, it was taken out from per_host_caches. so we just don't put it back. The header comment means client_map_. I'll clarify that. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS20, Line 96: it > what is "it"? the client? or the connection? does this NULL out *client_ke "it" means the client. DestroyClient NULL out *client_key like Release Client() PS20, Line 96: cache > is this "the per-host cache" that is referenced everywhere else, or somethi Done PS20, Line 224: indicate > indicates if returned status is okay, means both RPC are completed. can or cannot retry really depends on the data being sent here. If this is TransmiteData, retry is not safe but if this is ReportStatusCb, retry is fine. Only caller knows that info. Line 225: /// the whole RPC call. > explain this more. is it because if the send RPC call failed, then the han that's right. Line 260: if (!status.ok()) { > so this is the only path where we have *retry_is_safe == true and an error Done Line 265: } catch (apache::thrift::TException& e) { > why does this path not need the timeout check also? I think the original logic is to give it a second chance but if it fails again, just fail the whole RPC. Line 271: client_is_unrecoverable_ = false; > why is this logic needed now, whereas it wasn't needed before? or is this a yes, it's another bug fix, I need to fix it in this patch, otherwise, the broken connection will cause lots of problem. IMPALA-2864 PS20, Line 301: the > whether the see my previous patch and Henry's comment "last_rpc_succeeded_" matches the code logic better. "client_is_unrecoverable_" to more clearly describe the state not sure which one is better. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: Line 115: ImpalaBackendClientCache* client_cache_; > now that we store the RuntimeState, let's get rid of this and just get the Done PS20, Line 153: wait > retry waiting for the response? Done PS20, Line 153: call > delete "call", and would be nice to write RPC Done PS20, Line 231: timeout_ms > where is this called with a non-zero timeout_ms? For now, we like to wait indefinitely to avoid logic change. But in the future, even after replace RPC implementation, it's better to have a reasonable timeout for all RPCs. Line 239: timeout_ms)); > this will throw away the original status message. Maybe use AddDetail() in Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: Line 125: if (!retry_is_safe) break; > so i guess this was a bug in the old code? yes, But for ReportExecStatus RPC, resend shouldn't cause any issue. handler just process the same status multiple times. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: PS20, Line 658: Rewrite > Add details to Done PS20, Line 669: Rewrite > same Done -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes