[ https://issues.apache.org/jira/browse/HDFS-11028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Clampffer updated HDFS-11028: ----------------------------------- Attachment: HDFS-11028.HDFS-8707.000.patch Posting cancel mega-patch. Plenty of work to do, but wanted to put it up in case anyone was curious (will be away for 1-2 weeks). The bulk of the code being changed is just forwarding handles down to the async operations so they can be bound. Includes: -connection level cancels (tested with included cancelable connect tool) -individual rpc level cancels (not well tested) -exposing all cancels in the C api I'm going to try and split connect cancels from RPC cancels since it's similar but mostly separate code paths. RPC cancels will get a new jira. The C bindings will just be ripped out for now since I haven't tested them at all and they aren't a priority for my work yet. > libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel > pending connections > ------------------------------------------------------------------------------------------ > > Key: HDFS-11028 > URL: https://issues.apache.org/jira/browse/HDFS-11028 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Attachments: HDFS-11028.HDFS-8707.000.patch > > > Cancel support is now reasonably robust except the case where a FileHandle > operation ends up causing the RpcEngine to try to create a new RpcConnection. > In HA configs it's common to have something like 10-20 failovers and a 20 > second failover delay (no exponential backoff just yet). This means that all > of the functions with synchronous interfaces can still block for many minutes > after an operation has been canceled, and often the cause of this is > something trivial like a bad config file. > The current design makes this sort of thing tricky to do because the > FileHandles need to be individually cancelable via CancelOperations, but they > share the RpcEngine that does the async magic. > Updated design: > Original design would end up forcing lots of reconnects. Not a huge issue on > an unauthenticated cluster but on a kerberized cluster this is a recipe for > Kerberos thinking we're attempting a replay attack. > User visible cancellation and internal resources cleanup are separable > issues. The former can be implemented by atomically swapping the callback of > the operation to be canceled with a no-op callback. The original callback is > then posted to the IoService with an OperationCanceled status and the user is > no longer blocked. For RPC cancels this is sufficient, it's not expensive to > keep a request around a little bit longer and when it's eventually invoked or > timed out it invokes the no-op callback and is ignored (other than a trace > level log notification). Connect cancels push a flag down into the RPC > engine to kill the connection and make sure it doesn't attempt to reconnect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org