It's true that there is no timely shutdown in the presence of full network
partitions but most likely statestore will detect that some hosts are
unreachable and eventually fail a bunch a queries.

However, even without any network partition, the shutdown of query
fragments on many hosts may not happen in a timely manner if RPC
cancellation is stuck waiting for a few nodes overloaded with network
traffic.
This is a case in which we don't want to regress from our existing
implementation with thrift RPC.

On Fri, Oct 13, 2017 at 2:12 PM, Dan Burkert <danburk...@apache.org> wrote:

> Given that Impala can't guarantee timely shutdown in the presence of full
> network partitions, and it only takes 130Mbps of bandwidth to transmit 8MiB
> in 500ms, why are changes to the protocol or lifecycle of RPCs necessary?
>
> - Dan
>
> On Fri, Oct 13, 2017 at 1:14 PM, Michael Ho <k...@cloudera.com> wrote:
>
>> And to provide more context for Impala's query cancellation, the
>> cancellation is asynchronous. In other words, the coordinator is only
>> responsible for notifying the remote Impalad to shut down the fragment
>> instances at a convenient time. The coordinator will not wait for
>> acknowledgement from all Impalad before proceeding. The assumption is that
>> the executor threads will periodically poll for cancellation request and
>> free up the resources soon after it sees one.
>>
>> On Fri, Oct 13, 2017 at 1:10 PM, Michael Ho <k...@cloudera.com> wrote:
>>
>>> The coordinator RPC to cancel the fragment instance usually has much
>>> smaller payload compared to the RPC for transmitting data between Impalad
>>> nodes so it has a higher chance of coming through. I agree that, when the
>>> network is in bad shape, cancellation requests may not reach all Impalad
>>> nodes. However, keep in mind that it's not necessarily the case that the
>>> entire network is swamped. It can be that one or a few nodes in the cluster
>>> which were pounded on with a lot of network traffic. In those cases, it
>>> seems all other fragment instances on other Impalad nodes will still block
>>> waiting for these few overloaded during cancellation and hold up memory
>>> resources on those Impalad nodes.
>>>
>>> Thanks,
>>> Michael
>>>
>>> On Fri, Oct 13, 2017 at 11:30 AM, Dan Burkert <d...@cloudera.com> wrote:
>>>
>>>> I'm still hung up on the premise that if the network is down,
>>>> cancellation should still be expected to complete in bounded time.  If the
>>>> network is partitioned or congested to the point of non-delivery, how is
>>>> the coordinator going to deliver the cancellation messages to the fragments
>>>> in the first place?  If we're allowing for arbitrary network delays, there
>>>> is no way to make this guarantee.  We're laser focussed on this issue of
>>>> cancelling mid-transmission RPCs, but transmission is a very minor slice of
>>>> the whole RPC lifecycle, and I don't believe it's worth complicating the
>>>> RPC protocol or lifecycle over.
>>>>
>>>> - Dan
>>>>
>>>> On Fri, Oct 13, 2017 at 11:01 AM, Sailesh Mukil <sail...@cloudera.com>
>>>> wrote:
>>>>
>>>>> On Fri, Oct 13, 2017 at 9:37 AM, Michael Ho <k...@cloudera.com> wrote:
>>>>>
>>>>> > There were some recent discussions about the proposed mechanism to
>>>>> cancel
>>>>> > an RPC in mid-transmission. I am writing this email to summarize the
>>>>> > problem cancellation is trying to solve and some proposed solutions.
>>>>> >
>>>>> > As part of IMPALA-2567 <https://issues.apache.org/jir
>>>>> a/browse/IMPALA-2567>,
>>>>>
>>>>> > Impala is porting its RPC facility to utilize the Kudu RPC library.
>>>>> One
>>>>> > major feature of KRPC is asynchronous RPC. This prevents the need for
>>>>> > Impala to create a separate thread for each communication channel
>>>>> between
>>>>> > fragment instances running on different Impala backends. When an
>>>>> > asynchronous RPC is in progress, the KRPC code needs to hold on to
>>>>> the RPC
>>>>> > payload provided and owned by a fragment instance. We have a
>>>>> singleton
>>>>> > messenger instance per Impalad node. The connection between two
>>>>> hosts is
>>>>> > shared by different fragment instances.
>>>>> >
>>>>> > At any point when the RPC is in-progress, the coordinator can cancel
>>>>> all
>>>>> > fragment instances of a query for various reasons. In which case, the
>>>>> > fragment execution threads need to cancel their in-flight RPC and
>>>>> free up
>>>>> > the RPC payloads (e.g. which are passed as sidecars of the RPC
>>>>> request)
>>>>> > within a reasonable amount of time (e.g. 500ms). Currently in the
>>>>> KRPC
>>>>> > code, when a cancellation is requested when an outbound call is in
>>>>> SENDING
>>>>> > state (i.e. it has potentially sent some or the whole header to the
>>>>> remote
>>>>> > server), it will wait for the entire payload to be sent before
>>>>> invoking the
>>>>> > completion callback specified in the async RPC request. At which
>>>>> point, the
>>>>> > KRPC client can assume that the RPC payload is no longer referenced
>>>>> by the
>>>>> > KRPC code and it can be freed.
>>>>> >
>>>>> > When the network connections between Impalad backends is functioning
>>>>> well,
>>>>> > it shouldn't be long before the payload is done sending as Impalad
>>>>> > internally tries to bound the payload size to 8MB (but not
>>>>> necessarily
>>>>> > guaranteed in some degenerated case e.g. row with 1GB string).
>>>>> However,
>>>>> > when the network gets congested or is very lossy, there is in theory
>>>>> no
>>>>> > bound on how long it takes to finish sending the payload. In other
>>>>> words,
>>>>> > the cancellation of the RPC can take unbounded amount of time to
>>>>> complete
>>>>> > and thus the cancellation of a fragment instance can also take
>>>>> infinitely
>>>>> > long. This is a chicken-and-egg problem. Usually, the user cancels
>>>>> the
>>>>> > query because the poor network performance caused it to run for too
>>>>> long.
>>>>> > With the unbounded wait, the user cannot even cancel the query
>>>>> within a
>>>>> > bounded period of time.
>>>>> >
>>>>> > Similar problem arises when an RPC times out. As the KRPC code stands
>>>>> > today, there is no guarantee that the payload is not referenced by
>>>>> the KRPC
>>>>> > code when the completion callback is invoked due to timeout. The
>>>>> client
>>>>> > doesn't get any notification on when the payload can be freed. In
>>>>> theory,
>>>>> > this can be solved by deferring the callback until the payload has
>>>>> been
>>>>> > completely sent. However, this may not always be desirable due to the
>>>>> > unbounded wait mentioned above. For example, if a client requests a
>>>>> timeout
>>>>> > of 60s, it's not desirable to invoke the callback say 300s after the
>>>>> > timeout.
>>>>> >
>>>>> > If you agree that the above are indeed problems, we need a RPC
>>>>> > cancellation mechanism which meets the following requirements:
>>>>> >
>>>>> > - the RPC payload is no longer referenced by the KRPC code when the
>>>>> > completion callback is invoked.
>>>>> > - the time between the cancellation request and the invocation of the
>>>>> > completion callback should be reasonably bounded.
>>>>> >
>>>>> > The followings are two different proposals floated:
>>>>> >
>>>>> > 1. Upgrade the KRPC protocol to insert a footer at the very end of
>>>>> each
>>>>> > outbound call transfer (i.e. after header, serialized protobuf
>>>>> request,
>>>>> > sidecars etc). The footer contains a flag which is parsed by the
>>>>> server to
>>>>> > determine whether the inbound call should be discarded. Normally,
>>>>> this flag
>>>>> > will be cleared. When an outbound call is cancelled while it's still
>>>>> in the
>>>>> > SENDING state, the footer is modified in place to have the flag set.
>>>>> In
>>>>> > addition, the sidecars of the outbound call transfers are modified
>>>>> to point
>>>>> > to a dummy buffer. In this way, the slices in an outbound transfer no
>>>>> > longer references the original payload of the RPC and the footer
>>>>> guarantees
>>>>> > that whatever is received by the server will be discarded so it's
>>>>> okay to
>>>>> > replace the original payload with dummy bytes. This allows the
>>>>> completion
>>>>> > callback to be invoked without waiting for the entire payload to be
>>>>> sent.
>>>>> >
>>>>> > Pros:
>>>>> > - relatively low overhead in network consumption (1 byte per
>>>>> outbound call
>>>>> > transfer)
>>>>> > - cancellation is complete once the reactor thread handles the
>>>>> > cancellation event and is not reliant on the network condition.
>>>>> >
>>>>> > Cons:
>>>>> > - the remaining bytes of the payload still need to be sent across a
>>>>> > congested network after cancellation.
>>>>> > - added complexity to the protocol.
>>>>> >
>>>>> > The details of the proposal is in KUDU-2065
>>>>> > <https://issues.apache.org/jira/browse/KUDU-2065>. The patch which
>>>>> > implemented this idea is at https://gerrit.cloudera.org/#/c/7599/
>>>>> > It was later reverted due to KUDU-2110
>>>>> > <https://issues.apache.org/jira/browse/KUDU-2110>.
>>>>>
>>>>> >
>>>>> > 2. Keep the KRPC protocol unchanged. When an outbound call is
>>>>> cancelled
>>>>> > when it's in SENDING state, register a timer with a timeout (e.g.
>>>>> 500ms).
>>>>> > When the timer expires, invoke a callback to call shutdown(fd,
>>>>> SHUT_WR) on
>>>>> > the socket. This shuts down one-half of the connection and prevents
>>>>> more
>>>>> > bytes from being sent while allowing bytes to be read on the client
>>>>> side.
>>>>> > recv() will eventually return 0 on the server side. After all other
>>>>> > in-flight RPCs for that connection have been replied to, the server
>>>>> should
>>>>> > close the connection. The client, after reading all the replies,
>>>>> should
>>>>> > also destroy the Connection object. All RPCs behind the cancelled
>>>>> the RPC
>>>>> > in the outbound queue on the client side are aborted and need to be
>>>>> retried.
>>>>> >
>>>>> > Pro:
>>>>> > - stop piling on a congested network after cancelling an RPC.
>>>>> > - cancellation is still bounded in time.
>>>>> >
>>>>> > Con:
>>>>> > - For Impalad, each host may be connected to all other hosts in the
>>>>> > cluster. In the worst case, this means closing all n connections. In
>>>>> total,
>>>>> > that could be n^2 connection being re-established concurrently. This
>>>>> can
>>>>> > easily trigger negotiation timeout in a secure cluster based on our
>>>>> testing
>>>>> > on bolt1k cluster, leading to unpredictable failures.
>>>>> >
>>>>> > - The RPCs behind the cancelled RPC in the outbound queue cannot be
>>>>> > retried until all previous in-flight RPCs have been replied to. This
>>>>> can be
>>>>> > unboundedly long if the remote server is overloaded.
>>>>> >
>>>>> >
>>>>> Regarding option 2, although this may work well with certain
>>>>> workloads, I'm
>>>>> not sure that this would be the right way to go about it. Also, in some
>>>>> scenarios, it could be very slow.
>>>>>
>>>>> From the viewpoint of Impala, cancellation can happen in many
>>>>> scenarios.
>>>>> Other than a query being cancelled by a user, cancellations also happen
>>>>> when there are runtime errors (OOM, found a bad file in the storage
>>>>> layer,
>>>>> etc.)
>>>>>
>>>>> Even aside from the negotiation storm mentioned above by Michael, if we
>>>>> were to tear down the connection for every cancel request we get, it
>>>>> will
>>>>> cause a slow down for all other queries running on those nodes as well,
>>>>> since they have to now wait longer until the connection is re-setup.
>>>>>
>>>>> We can imagine a scenario where the cluster is under high stress
>>>>> leading
>>>>> quite a few queries to OOM, which would send out cancellations to all
>>>>> in
>>>>> flight RPCs for those respective queries, causing a chain of
>>>>> cancellations.
>>>>> We can also imagine multiple CANCEL requests from different queries
>>>>> being
>>>>> queued for the same connection, so the first CANCEL request tears down
>>>>> the
>>>>> connection, and sets it up again, only to be torn down again by another
>>>>> queued CANCEL request. Sure, we can try to detect multiple CANCEL
>>>>> requests
>>>>> in a queue and do it only once, but that goes back to adding
>>>>> complexity to
>>>>> the protocol.
>>>>>
>>>>> To my knowledge, I don't know of a RPC library that takes option 2,
>>>>> but I
>>>>> know at least one that does something similar to option 1 (grpc).
>>>>> That's my
>>>>> two cents on the issue, but I may have missed out thinking about it
>>>>> from
>>>>> different perspectives, so feel free to correct me if I got something
>>>>> wrong.
>>>>>
>>>>> Please feel free to comment on the proposals above or suggest new
>>>>> ideas.
>>>>> >
>>>>> > --
>>>>> > Thanks,
>>>>> > Michael
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Michael
>>>
>>
>>
>>
>> --
>> Thanks,
>> Michael
>>
>
>


-- 
Thanks,
Michael

Reply via email to