It's true that there is no timely shutdown in the presence of full network partitions but most likely statestore will detect that some hosts are unreachable and eventually fail a bunch a queries.
However, even without any network partition, the shutdown of query fragments on many hosts may not happen in a timely manner if RPC cancellation is stuck waiting for a few nodes overloaded with network traffic. This is a case in which we don't want to regress from our existing implementation with thrift RPC. On Fri, Oct 13, 2017 at 2:12 PM, Dan Burkert <danburk...@apache.org> wrote: > Given that Impala can't guarantee timely shutdown in the presence of full > network partitions, and it only takes 130Mbps of bandwidth to transmit 8MiB > in 500ms, why are changes to the protocol or lifecycle of RPCs necessary? > > - Dan > > On Fri, Oct 13, 2017 at 1:14 PM, Michael Ho <k...@cloudera.com> wrote: > >> And to provide more context for Impala's query cancellation, the >> cancellation is asynchronous. In other words, the coordinator is only >> responsible for notifying the remote Impalad to shut down the fragment >> instances at a convenient time. The coordinator will not wait for >> acknowledgement from all Impalad before proceeding. The assumption is that >> the executor threads will periodically poll for cancellation request and >> free up the resources soon after it sees one. >> >> On Fri, Oct 13, 2017 at 1:10 PM, Michael Ho <k...@cloudera.com> wrote: >> >>> The coordinator RPC to cancel the fragment instance usually has much >>> smaller payload compared to the RPC for transmitting data between Impalad >>> nodes so it has a higher chance of coming through. I agree that, when the >>> network is in bad shape, cancellation requests may not reach all Impalad >>> nodes. However, keep in mind that it's not necessarily the case that the >>> entire network is swamped. It can be that one or a few nodes in the cluster >>> which were pounded on with a lot of network traffic. In those cases, it >>> seems all other fragment instances on other Impalad nodes will still block >>> waiting for these few overloaded during cancellation and hold up memory >>> resources on those Impalad nodes. >>> >>> Thanks, >>> Michael >>> >>> On Fri, Oct 13, 2017 at 11:30 AM, Dan Burkert <d...@cloudera.com> wrote: >>> >>>> I'm still hung up on the premise that if the network is down, >>>> cancellation should still be expected to complete in bounded time. If the >>>> network is partitioned or congested to the point of non-delivery, how is >>>> the coordinator going to deliver the cancellation messages to the fragments >>>> in the first place? If we're allowing for arbitrary network delays, there >>>> is no way to make this guarantee. We're laser focussed on this issue of >>>> cancelling mid-transmission RPCs, but transmission is a very minor slice of >>>> the whole RPC lifecycle, and I don't believe it's worth complicating the >>>> RPC protocol or lifecycle over. >>>> >>>> - Dan >>>> >>>> On Fri, Oct 13, 2017 at 11:01 AM, Sailesh Mukil <sail...@cloudera.com> >>>> wrote: >>>> >>>>> On Fri, Oct 13, 2017 at 9:37 AM, Michael Ho <k...@cloudera.com> wrote: >>>>> >>>>> > There were some recent discussions about the proposed mechanism to >>>>> cancel >>>>> > an RPC in mid-transmission. I am writing this email to summarize the >>>>> > problem cancellation is trying to solve and some proposed solutions. >>>>> > >>>>> > As part of IMPALA-2567 <https://issues.apache.org/jir >>>>> a/browse/IMPALA-2567>, >>>>> >>>>> > Impala is porting its RPC facility to utilize the Kudu RPC library. >>>>> One >>>>> > major feature of KRPC is asynchronous RPC. This prevents the need for >>>>> > Impala to create a separate thread for each communication channel >>>>> between >>>>> > fragment instances running on different Impala backends. When an >>>>> > asynchronous RPC is in progress, the KRPC code needs to hold on to >>>>> the RPC >>>>> > payload provided and owned by a fragment instance. We have a >>>>> singleton >>>>> > messenger instance per Impalad node. The connection between two >>>>> hosts is >>>>> > shared by different fragment instances. >>>>> > >>>>> > At any point when the RPC is in-progress, the coordinator can cancel >>>>> all >>>>> > fragment instances of a query for various reasons. In which case, the >>>>> > fragment execution threads need to cancel their in-flight RPC and >>>>> free up >>>>> > the RPC payloads (e.g. which are passed as sidecars of the RPC >>>>> request) >>>>> > within a reasonable amount of time (e.g. 500ms). Currently in the >>>>> KRPC >>>>> > code, when a cancellation is requested when an outbound call is in >>>>> SENDING >>>>> > state (i.e. it has potentially sent some or the whole header to the >>>>> remote >>>>> > server), it will wait for the entire payload to be sent before >>>>> invoking the >>>>> > completion callback specified in the async RPC request. At which >>>>> point, the >>>>> > KRPC client can assume that the RPC payload is no longer referenced >>>>> by the >>>>> > KRPC code and it can be freed. >>>>> > >>>>> > When the network connections between Impalad backends is functioning >>>>> well, >>>>> > it shouldn't be long before the payload is done sending as Impalad >>>>> > internally tries to bound the payload size to 8MB (but not >>>>> necessarily >>>>> > guaranteed in some degenerated case e.g. row with 1GB string). >>>>> However, >>>>> > when the network gets congested or is very lossy, there is in theory >>>>> no >>>>> > bound on how long it takes to finish sending the payload. In other >>>>> words, >>>>> > the cancellation of the RPC can take unbounded amount of time to >>>>> complete >>>>> > and thus the cancellation of a fragment instance can also take >>>>> infinitely >>>>> > long. This is a chicken-and-egg problem. Usually, the user cancels >>>>> the >>>>> > query because the poor network performance caused it to run for too >>>>> long. >>>>> > With the unbounded wait, the user cannot even cancel the query >>>>> within a >>>>> > bounded period of time. >>>>> > >>>>> > Similar problem arises when an RPC times out. As the KRPC code stands >>>>> > today, there is no guarantee that the payload is not referenced by >>>>> the KRPC >>>>> > code when the completion callback is invoked due to timeout. The >>>>> client >>>>> > doesn't get any notification on when the payload can be freed. In >>>>> theory, >>>>> > this can be solved by deferring the callback until the payload has >>>>> been >>>>> > completely sent. However, this may not always be desirable due to the >>>>> > unbounded wait mentioned above. For example, if a client requests a >>>>> timeout >>>>> > of 60s, it's not desirable to invoke the callback say 300s after the >>>>> > timeout. >>>>> > >>>>> > If you agree that the above are indeed problems, we need a RPC >>>>> > cancellation mechanism which meets the following requirements: >>>>> > >>>>> > - the RPC payload is no longer referenced by the KRPC code when the >>>>> > completion callback is invoked. >>>>> > - the time between the cancellation request and the invocation of the >>>>> > completion callback should be reasonably bounded. >>>>> > >>>>> > The followings are two different proposals floated: >>>>> > >>>>> > 1. Upgrade the KRPC protocol to insert a footer at the very end of >>>>> each >>>>> > outbound call transfer (i.e. after header, serialized protobuf >>>>> request, >>>>> > sidecars etc). The footer contains a flag which is parsed by the >>>>> server to >>>>> > determine whether the inbound call should be discarded. Normally, >>>>> this flag >>>>> > will be cleared. When an outbound call is cancelled while it's still >>>>> in the >>>>> > SENDING state, the footer is modified in place to have the flag set. >>>>> In >>>>> > addition, the sidecars of the outbound call transfers are modified >>>>> to point >>>>> > to a dummy buffer. In this way, the slices in an outbound transfer no >>>>> > longer references the original payload of the RPC and the footer >>>>> guarantees >>>>> > that whatever is received by the server will be discarded so it's >>>>> okay to >>>>> > replace the original payload with dummy bytes. This allows the >>>>> completion >>>>> > callback to be invoked without waiting for the entire payload to be >>>>> sent. >>>>> > >>>>> > Pros: >>>>> > - relatively low overhead in network consumption (1 byte per >>>>> outbound call >>>>> > transfer) >>>>> > - cancellation is complete once the reactor thread handles the >>>>> > cancellation event and is not reliant on the network condition. >>>>> > >>>>> > Cons: >>>>> > - the remaining bytes of the payload still need to be sent across a >>>>> > congested network after cancellation. >>>>> > - added complexity to the protocol. >>>>> > >>>>> > The details of the proposal is in KUDU-2065 >>>>> > <https://issues.apache.org/jira/browse/KUDU-2065>. The patch which >>>>> > implemented this idea is at https://gerrit.cloudera.org/#/c/7599/ >>>>> > It was later reverted due to KUDU-2110 >>>>> > <https://issues.apache.org/jira/browse/KUDU-2110>. >>>>> >>>>> > >>>>> > 2. Keep the KRPC protocol unchanged. When an outbound call is >>>>> cancelled >>>>> > when it's in SENDING state, register a timer with a timeout (e.g. >>>>> 500ms). >>>>> > When the timer expires, invoke a callback to call shutdown(fd, >>>>> SHUT_WR) on >>>>> > the socket. This shuts down one-half of the connection and prevents >>>>> more >>>>> > bytes from being sent while allowing bytes to be read on the client >>>>> side. >>>>> > recv() will eventually return 0 on the server side. After all other >>>>> > in-flight RPCs for that connection have been replied to, the server >>>>> should >>>>> > close the connection. The client, after reading all the replies, >>>>> should >>>>> > also destroy the Connection object. All RPCs behind the cancelled >>>>> the RPC >>>>> > in the outbound queue on the client side are aborted and need to be >>>>> retried. >>>>> > >>>>> > Pro: >>>>> > - stop piling on a congested network after cancelling an RPC. >>>>> > - cancellation is still bounded in time. >>>>> > >>>>> > Con: >>>>> > - For Impalad, each host may be connected to all other hosts in the >>>>> > cluster. In the worst case, this means closing all n connections. In >>>>> total, >>>>> > that could be n^2 connection being re-established concurrently. This >>>>> can >>>>> > easily trigger negotiation timeout in a secure cluster based on our >>>>> testing >>>>> > on bolt1k cluster, leading to unpredictable failures. >>>>> > >>>>> > - The RPCs behind the cancelled RPC in the outbound queue cannot be >>>>> > retried until all previous in-flight RPCs have been replied to. This >>>>> can be >>>>> > unboundedly long if the remote server is overloaded. >>>>> > >>>>> > >>>>> Regarding option 2, although this may work well with certain >>>>> workloads, I'm >>>>> not sure that this would be the right way to go about it. Also, in some >>>>> scenarios, it could be very slow. >>>>> >>>>> From the viewpoint of Impala, cancellation can happen in many >>>>> scenarios. >>>>> Other than a query being cancelled by a user, cancellations also happen >>>>> when there are runtime errors (OOM, found a bad file in the storage >>>>> layer, >>>>> etc.) >>>>> >>>>> Even aside from the negotiation storm mentioned above by Michael, if we >>>>> were to tear down the connection for every cancel request we get, it >>>>> will >>>>> cause a slow down for all other queries running on those nodes as well, >>>>> since they have to now wait longer until the connection is re-setup. >>>>> >>>>> We can imagine a scenario where the cluster is under high stress >>>>> leading >>>>> quite a few queries to OOM, which would send out cancellations to all >>>>> in >>>>> flight RPCs for those respective queries, causing a chain of >>>>> cancellations. >>>>> We can also imagine multiple CANCEL requests from different queries >>>>> being >>>>> queued for the same connection, so the first CANCEL request tears down >>>>> the >>>>> connection, and sets it up again, only to be torn down again by another >>>>> queued CANCEL request. Sure, we can try to detect multiple CANCEL >>>>> requests >>>>> in a queue and do it only once, but that goes back to adding >>>>> complexity to >>>>> the protocol. >>>>> >>>>> To my knowledge, I don't know of a RPC library that takes option 2, >>>>> but I >>>>> know at least one that does something similar to option 1 (grpc). >>>>> That's my >>>>> two cents on the issue, but I may have missed out thinking about it >>>>> from >>>>> different perspectives, so feel free to correct me if I got something >>>>> wrong. >>>>> >>>>> Please feel free to comment on the proposals above or suggest new >>>>> ideas. >>>>> > >>>>> > -- >>>>> > Thanks, >>>>> > Michael >>>>> > >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks, >>> Michael >>> >> >> >> >> -- >> Thanks, >> Michael >> > > -- Thanks, Michael