[ https://issues.apache.org/jira/browse/HDFS-11028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Clampffer updated HDFS-11028: ----------------------------------- Attachment: HDFS-11028.HDFS-8707.002.patch New patch, should be ready to review. -Added C bindings that let the C style hdfsFS FileSystem be created and initialized without connecting so that the new hdfsCancelPendingConnection can be used to cancel it from another thread. -Added a C example, runs under valgrind without leaks. Testing has been done with the C++ and C examples as described in my previous comment: load a bad HA config so things hang, hit control-C, should exit immediately with operation canceled. Should be able to run it under valgrind as well without errors. > libhdfs++: FileHandleImpl::CancelOperations needs to be able to cancel > pending connections > ------------------------------------------------------------------------------------------ > > Key: HDFS-11028 > URL: https://issues.apache.org/jira/browse/HDFS-11028 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: James Clampffer > Assignee: James Clampffer > Attachments: HDFS-11028.HDFS-8707.000.patch, > HDFS-11028.HDFS-8707.001.patch, HDFS-11028.HDFS-8707.002.patch > > > Cancel support is now reasonably robust except the case where a FileHandle > operation ends up causing the RpcEngine to try to create a new RpcConnection. > In HA configs it's common to have something like 10-20 failovers and a 20 > second failover delay (no exponential backoff just yet). This means that all > of the functions with synchronous interfaces can still block for many minutes > after an operation has been canceled, and often the cause of this is > something trivial like a bad config file. > The current design makes this sort of thing tricky to do because the > FileHandles need to be individually cancelable via CancelOperations, but they > share the RpcEngine that does the async magic. > Updated design: > Original design would end up forcing lots of reconnects. Not a huge issue on > an unauthenticated cluster but on a kerberized cluster this is a recipe for > Kerberos thinking we're attempting a replay attack. > User visible cancellation and internal resources cleanup are separable > issues. The former can be implemented by atomically swapping the callback of > the operation to be canceled with a no-op callback. The original callback is > then posted to the IoService with an OperationCanceled status and the user is > no longer blocked. For RPC cancels this is sufficient, it's not expensive to > keep a request around a little bit longer and when it's eventually invoked or > timed out it invokes the no-op callback and is ignored (other than a trace > level log notification). Connect cancels push a flag down into the RPC > engine to kill the connection and make sure it doesn't attempt to reconnect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org