Github user revans2 commented on the issue: https://github.com/apache/storm/pull/414 @danny0405 like @HeartSaVioR said it depends on the versions you are upgrading between. Most of the time we have maintained wire and binary compatibility so you can do the upgrade piecemeal. This should work between versions of storm that have the same major version number. 1.0.0 to 1.1.0, or 1.1.0 to 1.1.2, but not 0.10.x to 1.0.0. The procedure that we follow when doing an upgrade is to 1) shutdown and upgrade nimbus (we are not currently running HA, but if we were step 1.b would be to upgrade the other nimbus instances one at a time) 2) pick a single node that is not upgraded yet. 2.b) install the new version of storm on the node. 2.c) shoot all the storm processes, supervisor, logviewer, and workers 2.d) clear out all of the state on the node (NOT needed every time, but we are cautious because of bugs in the past) 2.e) relaunch the supervisor and logviewer. 3) repeat until all of the nodes are done. For our large clusters we actually do a few nodes at a time, not one. This procedure does have a few issues. Primarily the biggest issue is churn in the worker processes. We try to avoid doing the upgrade a lot because it is not truly transparent to all topologies. They recover, but they have had every one of their worker processes shot at least one, and possibly multiple times. This can cause data issues in non-trident topologies, and can slow down the processing in trident. I would recommend that you do it a little differently, and this is what we want to move to. for each node in parallel as much as possible install the new version of storm then shoot the supervisor and the logviewer. Wait for them to all come back up, or at least enough that you feel good about it. Then again as parallel as possible shoot all of the worker processes on all of the nodes. This still has the disadvantage of having all of the worker processes being shot and slowing things down, but they are guaranteed to only be shot once, and the recovery time should be much faster. The supervisor relaunches them quickly instead of possibly having nimbus time them out and reschedule them on a node that has not been upgraded yet.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---