[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329449#comment-14329449
 ] 

ASF GitHub Bot commented on STORM-682:
--

GitHub user Parth-Brahmbhatt opened a pull request:

https://github.com/apache/storm/pull/437

STORM-682: supervisor should handle worker state corruption gracefully.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Parth-Brahmbhatt/incubator-storm STORM-682

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/437.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #437


commit afd8f81ba2650423184be3fcf6e00dd7c558acbe
Author: Parth Brahmbhatt 
Date:   2015-02-20T19:56:22Z

STORM-682: supervisor should handle worker state corruption gracefully.




> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329997#comment-14329997
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user nathanmarz commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-75356387
  
-1

Please change the catch block to explicitly return nil. Don't depend on 
log-warn to do that for you.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-02-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333603#comment-14333603
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user Parth-Brahmbhatt commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-75597555
  
@nathanmarz done.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343386#comment-14343386
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user kishorvpatil commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-76751433
  
Looks good. +1


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344522#comment-14344522
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user harshach commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-76887513
  
+1. @nathanmarz can you review the latest patch. Thanks.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349012#comment-14349012
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-77394741
  
+1


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349014#comment-14349014
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user ptgoetz commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-77394892
  
We may want to consider applying this to the 0.9.x branch as well.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352173#comment-14352173
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user harshach commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-77764874
  
@nathanmarz we need this PR for 0.9.4 release . Please take a look at the 
latest patch.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353369#comment-14353369
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user Parth-Brahmbhatt commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-77916212
  
@ptgoetz I will create new jira for 0.9.x branch and apply the fix.
@nathanmarz given this is blocking the release and you have a -1 vote I 
would really appreciate if you could take some time to review this again.


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353401#comment-14353401
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user nathanmarz commented on the pull request:

https://github.com/apache/storm/pull/437#issuecomment-77919851
  
+1


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-682) Supervisor local worker state corrupted and failing to start.

2015-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359579#comment-14359579
 ] 

ASF GitHub Bot commented on STORM-682:
--

Github user asfgit closed the pull request at:

https://github.com/apache/storm/pull/437


> Supervisor local worker state corrupted and failing to start.
> -
>
> Key: STORM-682
> URL: https://issues.apache.org/jira/browse/STORM-682
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Parth Brahmbhatt
>Assignee: Parth Brahmbhatt
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the 
> local state of the supervisors get corrupted.The only way to recover the 
> supervisor from this state is to delete the local state folder where 
> supervisor stores all worker information.This fix can get very cumbersome if 
> it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned 
> store files are created vs the deletion order of those files. LocalState.put 
> first creates a data file X and then marks a success by creating a file 
> X.version.  During get it first checks for all *.version files , tries to 
> find the largest value of X and then issues a read against X. See the below 
> pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>  versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>  latest_version = max(versions)
>  read  local-state/workers/workerId/heartbeats/latest_version [Note there 
> is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets 
> deleted  but X.version fails to delete the supervisor fails to start with 
> FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted 
> before the data file and catch any IOException when reading worker heartbeats 
> to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)