[ https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343386#comment-14343386 ]
ASF GitHub Bot commented on STORM-682: -------------------------------------- Github user kishorvpatil commented on the pull request: https://github.com/apache/storm/pull/437#issuecomment-76751433 Looks good. +1 > Supervisor local worker state corrupted and failing to start. > ------------------------------------------------------------- > > Key: STORM-682 > URL: https://issues.apache.org/jira/browse/STORM-682 > Project: Apache Storm > Issue Type: Bug > Reporter: Parth Brahmbhatt > Assignee: Parth Brahmbhatt > > If supervisor's cleanup of a worker fails to delete some heartbeat files the > local state of the supervisors get corrupted.The only way to recover the > supervisor from this state is to delete the local state folder where > supervisor stores all worker information.This fix can get very cumbersome if > it happens on multiple worker nodes. > The root cause of the issue is the order in which worker heartbeat versioned > store files are created vs the deletion order of those files. LocalState.put > first creates a data file X and then marks a success by creating a file > X.version. During get it first checks for all *.version files , tries to > find the largest value of X and then issues a read against X. See the below > pseudo code > {code:java} > start_supervisor() { > workerIds = `ls local-state/workers` > for each workerId in workerIds > versions = `ls local-state/workers/workerId/heartbeats/*.version` > latest_version = max(versions) > read local-state/workers/workerId/heartbeats/latest_version [Note there > is no .version extension] > } > {code} > During cleanup it first tries to delete file X and then X.version. If X gets > deleted but X.version fails to delete the supervisor fails to start with > FileNotFoundException in the code above. > We propose to change the deletion order so the .version files get deleted > before the data file and catch any IOException when reading worker heartbeats > to avoid supervisor failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)