GitHub user danny0405 opened a pull request:
https://github.com/apache/storm/pull/2493
[STORM-2879] Supervisor collapse continuously when there is a expired
assignment for overdue storm
We do not make a transaction when supervisor clean up local files for a
overdue storm, if an exception occurred during deleting storm-code/ser/jar, an
overdue local assignment will be left on disk.
Then when supervisor restart from the exception above, the slots will be
initial and container will be recovered from LocalAssignments, the blob store
will fetch the files from Nimbus/Master, but will get a KeyNotFoundException,
and supervisor collapses again.
So lets just make the current assignment null when we first initial a Slot
and recover container from local assignment, let supervisor clean up the local
assignment and Nimbus/master will reassign it finally.
This is the JIRA issue
[STORM-2879](https://issues.apache.org/jira/browse/STORM-2879).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/danny0405/storm slot-bug-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/2493.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2493
----
commit 9655d0dc8c4f01e17edc3ff823cf7446dbc9930e
Author: chenyuzhao <chenyuzhao@...>
Date: 2018-01-03T07:31:38Z
fix STORM-2879
----
---