[jira] [Updated] (HDFS-11028) libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel pending connections

James Clampffer (JIRA) Thu, 22 Dec 2016 11:05:14 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-11028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


James Clampffer updated HDFS-11028:
-----------------------------------
    Attachment: HDFS-11028.HDFS-8707.000.patch

Posting cancel mega-patch.  Plenty of work to do, but wanted to put it up in 
case anyone was curious (will be away for 1-2 weeks).  The bulk of the code 
being changed is just forwarding handles down to the async operations so they 
can be bound.

Includes:
-connection level cancels (tested with included cancelable connect tool)
-individual rpc level cancels (not well tested)
-exposing all cancels in the C api

I'm going to try and split connect cancels from RPC cancels since it's similar 
but mostly separate code paths.  RPC cancels will get a new jira.  The C 
bindings will just be ripped out for now since I haven't tested them at all and 
they aren't a priority for my work yet.

> libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel 
> pending connections
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11028
>                 URL: https://issues.apache.org/jira/browse/HDFS-11028
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-11028.HDFS-8707.000.patch
>
>
> Cancel support is now reasonably robust except the case where a FileHandle 
> operation ends up causing the RpcEngine to try to create a new RpcConnection. 
>  In HA configs it's common to have something like 10-20 failovers and a 20 
> second failover delay (no exponential backoff just yet). This means that all 
> of the functions with synchronous interfaces can still block for many minutes 
> after an operation has been canceled, and often the cause of this is 
> something trivial like a bad config file.
> The current design makes this sort of thing tricky to do because the 
> FileHandles need to be individually cancelable via CancelOperations, but they 
> share the RpcEngine that does the async magic.
> Updated design:
> Original design would end up forcing lots of reconnects.  Not a huge issue on 
> an unauthenticated cluster but on a kerberized cluster this is a recipe for 
> Kerberos thinking we're attempting a replay attack.
> User visible cancellation and internal resources cleanup are separable 
> issues.  The former can be implemented by atomically swapping the callback of 
> the operation to be canceled with a no-op callback.  The original callback is 
> then posted to the IoService with an OperationCanceled status and the user is 
> no longer blocked.  For RPC cancels this is sufficient, it's not expensive to 
> keep a request around a little bit longer and when it's eventually invoked or 
> timed out it invokes the no-op callback and is ignored (other than a trace 
> level log notification).  Connect cancels push a flag down into the RPC 
> engine to kill the connection and make sure it doesn't attempt to reconnect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-11028) libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel pending connections

Reply via email to