Given that Impala can't guarantee timely shutdown in the presence of full
network partitions, and it only takes 130Mbps of bandwidth to transmit 8MiB
in 500ms, why are changes to the protocol or lifecycle of RPCs necessary?

- Dan

On Fri, Oct 13, 2017 at 1:14 PM, Michael Ho <k...@cloudera.com> wrote:

> And to provide more context for Impala's query cancellation, the
> cancellation is asynchronous. In other words, the coordinator is only
> responsible for notifying the remote Impalad to shut down the fragment
> instances at a convenient time. The coordinator will not wait for
> acknowledgement from all Impalad before proceeding. The assumption is that
> the executor threads will periodically poll for cancellation request and
> free up the resources soon after it sees one.
>
> On Fri, Oct 13, 2017 at 1:10 PM, Michael Ho <k...@cloudera.com> wrote:
>
>> The coordinator RPC to cancel the fragment instance usually has much
>> smaller payload compared to the RPC for transmitting data between Impalad
>> nodes so it has a higher chance of coming through. I agree that, when the
>> network is in bad shape, cancellation requests may not reach all Impalad
>> nodes. However, keep in mind that it's not necessarily the case that the
>> entire network is swamped. It can be that one or a few nodes in the cluster
>> which were pounded on with a lot of network traffic. In those cases, it
>> seems all other fragment instances on other Impalad nodes will still block
>> waiting for these few overloaded during cancellation and hold up memory
>> resources on those Impalad nodes.
>>
>> Thanks,
>> Michael
>>
>> On Fri, Oct 13, 2017 at 11:30 AM, Dan Burkert <d...@cloudera.com> wrote:
>>
>>> I'm still hung up on the premise that if the network is down,
>>> cancellation should still be expected to complete in bounded time.  If the
>>> network is partitioned or congested to the point of non-delivery, how is
>>> the coordinator going to deliver the cancellation messages to the fragments
>>> in the first place?  If we're allowing for arbitrary network delays, there
>>> is no way to make this guarantee.  We're laser focussed on this issue of
>>> cancelling mid-transmission RPCs, but transmission is a very minor slice of
>>> the whole RPC lifecycle, and I don't believe it's worth complicating the
>>> RPC protocol or lifecycle over.
>>>
>>> - Dan
>>>
>>> On Fri, Oct 13, 2017 at 11:01 AM, Sailesh Mukil <sail...@cloudera.com>
>>> wrote:
>>>
>>>> On Fri, Oct 13, 2017 at 9:37 AM, Michael Ho <k...@cloudera.com> wrote:
>>>>
>>>> > There were some recent discussions about the proposed mechanism to
>>>> cancel
>>>> > an RPC in mid-transmission. I am writing this email to summarize the
>>>> > problem cancellation is trying to solve and some proposed solutions.
>>>> >
>>>> > As part of IMPALA-2567 <https://issues.apache.org/jir
>>>> a/browse/IMPALA-2567>,
>>>>
>>>> > Impala is porting its RPC facility to utilize the Kudu RPC library.
>>>> One
>>>> > major feature of KRPC is asynchronous RPC. This prevents the need for
>>>> > Impala to create a separate thread for each communication channel
>>>> between
>>>> > fragment instances running on different Impala backends. When an
>>>> > asynchronous RPC is in progress, the KRPC code needs to hold on to
>>>> the RPC
>>>> > payload provided and owned by a fragment instance. We have a singleton
>>>> > messenger instance per Impalad node. The connection between two hosts
>>>> is
>>>> > shared by different fragment instances.
>>>> >
>>>> > At any point when the RPC is in-progress, the coordinator can cancel
>>>> all
>>>> > fragment instances of a query for various reasons. In which case, the
>>>> > fragment execution threads need to cancel their in-flight RPC and
>>>> free up
>>>> > the RPC payloads (e.g. which are passed as sidecars of the RPC
>>>> request)
>>>> > within a reasonable amount of time (e.g. 500ms). Currently in the KRPC
>>>> > code, when a cancellation is requested when an outbound call is in
>>>> SENDING
>>>> > state (i.e. it has potentially sent some or the whole header to the
>>>> remote
>>>> > server), it will wait for the entire payload to be sent before
>>>> invoking the
>>>> > completion callback specified in the async RPC request. At which
>>>> point, the
>>>> > KRPC client can assume that the RPC payload is no longer referenced
>>>> by the
>>>> > KRPC code and it can be freed.
>>>> >
>>>> > When the network connections between Impalad backends is functioning
>>>> well,
>>>> > it shouldn't be long before the payload is done sending as Impalad
>>>> > internally tries to bound the payload size to 8MB (but not necessarily
>>>> > guaranteed in some degenerated case e.g. row with 1GB string).
>>>> However,
>>>> > when the network gets congested or is very lossy, there is in theory
>>>> no
>>>> > bound on how long it takes to finish sending the payload. In other
>>>> words,
>>>> > the cancellation of the RPC can take unbounded amount of time to
>>>> complete
>>>> > and thus the cancellation of a fragment instance can also take
>>>> infinitely
>>>> > long. This is a chicken-and-egg problem. Usually, the user cancels the
>>>> > query because the poor network performance caused it to run for too
>>>> long.
>>>> > With the unbounded wait, the user cannot even cancel the query within
>>>> a
>>>> > bounded period of time.
>>>> >
>>>> > Similar problem arises when an RPC times out. As the KRPC code stands
>>>> > today, there is no guarantee that the payload is not referenced by
>>>> the KRPC
>>>> > code when the completion callback is invoked due to timeout. The
>>>> client
>>>> > doesn't get any notification on when the payload can be freed. In
>>>> theory,
>>>> > this can be solved by deferring the callback until the payload has
>>>> been
>>>> > completely sent. However, this may not always be desirable due to the
>>>> > unbounded wait mentioned above. For example, if a client requests a
>>>> timeout
>>>> > of 60s, it's not desirable to invoke the callback say 300s after the
>>>> > timeout.
>>>> >
>>>> > If you agree that the above are indeed problems, we need a RPC
>>>> > cancellation mechanism which meets the following requirements:
>>>> >
>>>> > - the RPC payload is no longer referenced by the KRPC code when the
>>>> > completion callback is invoked.
>>>> > - the time between the cancellation request and the invocation of the
>>>> > completion callback should be reasonably bounded.
>>>> >
>>>> > The followings are two different proposals floated:
>>>> >
>>>> > 1. Upgrade the KRPC protocol to insert a footer at the very end of
>>>> each
>>>> > outbound call transfer (i.e. after header, serialized protobuf
>>>> request,
>>>> > sidecars etc). The footer contains a flag which is parsed by the
>>>> server to
>>>> > determine whether the inbound call should be discarded. Normally,
>>>> this flag
>>>> > will be cleared. When an outbound call is cancelled while it's still
>>>> in the
>>>> > SENDING state, the footer is modified in place to have the flag set.
>>>> In
>>>> > addition, the sidecars of the outbound call transfers are modified to
>>>> point
>>>> > to a dummy buffer. In this way, the slices in an outbound transfer no
>>>> > longer references the original payload of the RPC and the footer
>>>> guarantees
>>>> > that whatever is received by the server will be discarded so it's
>>>> okay to
>>>> > replace the original payload with dummy bytes. This allows the
>>>> completion
>>>> > callback to be invoked without waiting for the entire payload to be
>>>> sent.
>>>> >
>>>> > Pros:
>>>> > - relatively low overhead in network consumption (1 byte per outbound
>>>> call
>>>> > transfer)
>>>> > - cancellation is complete once the reactor thread handles the
>>>> > cancellation event and is not reliant on the network condition.
>>>> >
>>>> > Cons:
>>>> > - the remaining bytes of the payload still need to be sent across a
>>>> > congested network after cancellation.
>>>> > - added complexity to the protocol.
>>>> >
>>>> > The details of the proposal is in KUDU-2065
>>>> > <https://issues.apache.org/jira/browse/KUDU-2065>. The patch which
>>>> > implemented this idea is at https://gerrit.cloudera.org/#/c/7599/
>>>> > It was later reverted due to KUDU-2110
>>>> > <https://issues.apache.org/jira/browse/KUDU-2110>.
>>>>
>>>> >
>>>> > 2. Keep the KRPC protocol unchanged. When an outbound call is
>>>> cancelled
>>>> > when it's in SENDING state, register a timer with a timeout (e.g.
>>>> 500ms).
>>>> > When the timer expires, invoke a callback to call shutdown(fd,
>>>> SHUT_WR) on
>>>> > the socket. This shuts down one-half of the connection and prevents
>>>> more
>>>> > bytes from being sent while allowing bytes to be read on the client
>>>> side.
>>>> > recv() will eventually return 0 on the server side. After all other
>>>> > in-flight RPCs for that connection have been replied to, the server
>>>> should
>>>> > close the connection. The client, after reading all the replies,
>>>> should
>>>> > also destroy the Connection object. All RPCs behind the cancelled the
>>>> RPC
>>>> > in the outbound queue on the client side are aborted and need to be
>>>> retried.
>>>> >
>>>> > Pro:
>>>> > - stop piling on a congested network after cancelling an RPC.
>>>> > - cancellation is still bounded in time.
>>>> >
>>>> > Con:
>>>> > - For Impalad, each host may be connected to all other hosts in the
>>>> > cluster. In the worst case, this means closing all n connections. In
>>>> total,
>>>> > that could be n^2 connection being re-established concurrently. This
>>>> can
>>>> > easily trigger negotiation timeout in a secure cluster based on our
>>>> testing
>>>> > on bolt1k cluster, leading to unpredictable failures.
>>>> >
>>>> > - The RPCs behind the cancelled RPC in the outbound queue cannot be
>>>> > retried until all previous in-flight RPCs have been replied to. This
>>>> can be
>>>> > unboundedly long if the remote server is overloaded.
>>>> >
>>>> >
>>>> Regarding option 2, although this may work well with certain workloads,
>>>> I'm
>>>> not sure that this would be the right way to go about it. Also, in some
>>>> scenarios, it could be very slow.
>>>>
>>>> From the viewpoint of Impala, cancellation can happen in many scenarios.
>>>> Other than a query being cancelled by a user, cancellations also happen
>>>> when there are runtime errors (OOM, found a bad file in the storage
>>>> layer,
>>>> etc.)
>>>>
>>>> Even aside from the negotiation storm mentioned above by Michael, if we
>>>> were to tear down the connection for every cancel request we get, it
>>>> will
>>>> cause a slow down for all other queries running on those nodes as well,
>>>> since they have to now wait longer until the connection is re-setup.
>>>>
>>>> We can imagine a scenario where the cluster is under high stress leading
>>>> quite a few queries to OOM, which would send out cancellations to all in
>>>> flight RPCs for those respective queries, causing a chain of
>>>> cancellations.
>>>> We can also imagine multiple CANCEL requests from different queries
>>>> being
>>>> queued for the same connection, so the first CANCEL request tears down
>>>> the
>>>> connection, and sets it up again, only to be torn down again by another
>>>> queued CANCEL request. Sure, we can try to detect multiple CANCEL
>>>> requests
>>>> in a queue and do it only once, but that goes back to adding complexity
>>>> to
>>>> the protocol.
>>>>
>>>> To my knowledge, I don't know of a RPC library that takes option 2, but
>>>> I
>>>> know at least one that does something similar to option 1 (grpc).
>>>> That's my
>>>> two cents on the issue, but I may have missed out thinking about it from
>>>> different perspectives, so feel free to correct me if I got something
>>>> wrong.
>>>>
>>>> Please feel free to comment on the proposals above or suggest new ideas.
>>>> >
>>>> > --
>>>> > Thanks,
>>>> > Michael
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Michael
>>
>
>
>
> --
> Thanks,
> Michael
>

Reply via email to