Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 12: Responding to comments from @Thomas. > Then there's blacklisting, which only takes a single failed rpc to > get a node blacklisted Right, but that RPC has already been retried multiple times, right? I should probably validate that for these specific error codes, the RPCs are actually retried. > So, I don't think that we need to be overly afraid about > blacklisting things aggressively, since if the node actually isn't > unhealthy blacklisting it once isn't going to have that big of an > impact on the cluster. One caveat here is with executor groups. IIUC there is a concept of "executor group health". If enough nodes in an executor group are either blacklisted / removed via the statestore, Impala will mark that executor group as unhealthy, and then stop scheduling any queries on it. Stolen from the commit message for IMPALA-8484: Executors can be configured with an executor group through the newly added '--executor_groups' flag. [...] Each executor group specification can optionally contain a minimum size, separated by a ':', e.g. --executor_groups default-pool-1:3. Only when the cluster membership contains at least that number of executors for the groups will it be considered for admission. So depending on an external auto-scaling policy, its possible that blacklisting a node (even for 12 seconds) makes the executor group unhealthy, and causes the auto-scaler to create a new executor group (a potentially expensive operation). > That's particularly the case since this work is aimed at > facilitating transparent query retry. We definitely want to avoid > scenarios where a TransmitData rpc fails, we don't blacklist the > node because the failure doesn't match our whitelist of posix codes > but we go ahead and retry the query anyways and then it fails again > because we schedule it on the same bad node again. Agree. I think the first milestone for transparent query retries is to just handle scenario where nodes crash while running a query. Based on some experiments I did, these were the only error codes that came up. I'm not confident enough in my understanding of Linux error codes to include any more than the current three, but am open to suggestions. In general, I think we can be pretty test driven here. The best way to figure out what additional errors codes to add would be to run some network fault injection tests and see what error codes Impala fails with. I'll open a follow up JIRA for that. > Of course, for this patch and the followup patch, we can just only > mark the status as retryable if we do in fact decide to blacklist > the node by doing the check for posix codes in KrpcDataStreamSender I generally wanted to avoid doing this. I think it would be better for only the Coordinator to make the ultimate decision about whether a node is blacklisted. KrpcDataStreamSender just provides enough information for the Coordinator to make the blacklisting decision. Otherwise the logic for determining if a node should be blacklisted will be spread across multiple classes, making it hard to reason about blacklisting decisions. However, I think you do bring up a good. We don't want a failed RPC to trigger a query retry unless we are actually blacklisting the target node. After thinking through it some more, maybe marking queries as "retryable" isn't necessary at all. Instead, whenever a query fails + blacklists a node, it is retried. So query retries are blacklisting driven. This would mean abandoning the patch for IMPALA-9138 (Classify certain errors as retryable), which I'm fine with doing. Let me know if you feel differently. > but there's still the problem that this patch will only blacklist > one node per query. Yeah, maybe this is too conservative. Assuming transparent query retries support "x" number of retries, Impala should be able to tolerate "x" faults across the duration of the query (and its retries). Maybe that is okay, maybe not. Don't have strong feelings here. > To be clear - I think for the first iteration of all of this its > probably fine to be fairly conservative about what we blacklist, > and if sometimes queries fail when we could have made them succeed > by being more aggressive that's okay (so long as they would've > failed even without the blacklisting work, such that its not a > regression). Agree. It feels safer to add functionality to blacklisting incrementally. > Just wanted to make the points that 1) blacklisting > was designed in a way where its not that big of a deal and we don't > need to worry too much about screwing up a cluster by blacklisting > a node unnecessarily and 2) we probably really want to avoid the > situation where we retry a query and it fails again for the same > reason Agree with both points. -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <michael...@gmail.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Fri, 13 Dec 2019 21:02:30 +0000 Gerrit-HasComments: No