[ 
https://issues.apache.org/jira/browse/HDFS-13112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358820#comment-16358820
 ] 

Daryn Sharp commented on HDFS-13112:
------------------------------------

Ok, state transitions hold the write lock when stopping the secret manager
 # I need to acquire the lock interruptibly to avoid the deadlock.
 # The write lock's interruptible method was exposed but not read, so added 
that.
 # The noInterruptsLock technically isn't necessary anymore if caller stopping 
the secret manager has the write lock, but per comments I left it there for 
safety.
 # The methods no longer throw InterruptedIOException, but leave or set the 
interrupt flag if interrupted.  Why?
 ** The abstract secret manager's master key roll currently catches ioes and 
plows ahead.  Expects the while (!done) to exit cleanly.  Survives the 
interrupt.  But can cause expiry to crash.
 ** Expiring a token does not catch exceptions, so an interrupt is currently 
fatal.  Not good.
 ** The run loop's sleep catches interrupted exceptions, allowing it to reach 
the while (!done).
 ** So why I do I leave the interrupt set instead of throwing?  Less risky to 
avoid changing the abstract secret manager.  Rolling a key will catch the 
interrupt.  If it also decides to expire tokens in the same cycle, it will try 
to acquire the read lock again, and deadlock again.  Leaving the thread 
interrupted prevents that and allows the run loop to hit the exit condition.
 ** Went ahead and individually lock per-token, in the off case there's a glut 
of tokens to expire and edit logging is being slow (think QJM).

> Token expiration edits may cause log corruption or deadlock
> -----------------------------------------------------------
>
>                 Key: HDFS-13112
>                 URL: https://issues.apache.org/jira/browse/HDFS-13112
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.1.0-beta, 0.23.8
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-13112.1.patch, HDFS-13112.patch
>
>
> HDFS-4477 specifically did not acquire the fsn lock during token cancellation 
> based on the belief that edit logs are thread-safe.  However, log rolling is 
> not thread-safe.  Failure to externally synchronize on the fsn lock during a 
> roll will cause problems.
> For sync edit logging, it may cause corruption by interspersing edits with 
> the end/start segment edits.  Async edit logging may encounter a deadlock if 
> the log queue overflows.  Luckily, losing the race is extremely rare.  In ~5 
> years, we've never encountered it.  However, HDFS-13051 lost the race with 
> async edits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to