[GitHub] storm issue #414: STORM-634: Storm serialization changed to thrift to suppor...

revans2 Mon, 18 Jul 2016 06:43:41 -0700

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/414
  
    @danny0405 like @HeartSaVioR said it depends on the versions you are 
upgrading between.  Most of the time we have maintained wire and binary 
compatibility so you can do the upgrade piecemeal. This should work between 
versions of storm that have the same major version number.  1.0.0 to 1.1.0, or 
1.1.0 to 1.1.2, but not 0.10.x to 1.0.0.
    
    The procedure that we follow when doing an upgrade is to
    
    1) shutdown and upgrade nimbus (we are not currently running HA, but if we 
were step 1.b would be to upgrade the other nimbus instances one at a time)
    2) pick a single node that is not upgraded yet.
    2.b) install the new version of storm on the node.
    2.c) shoot all the storm processes, supervisor, logviewer, and workers
    2.d) clear out all of the state on the node (NOT needed every time, but we 
are cautious because of bugs in the past)
    2.e) relaunch the supervisor and logviewer.
    3) repeat until all of the nodes are done.
    
    For our large clusters we actually do a few nodes at a time, not one. This 
procedure does have a few issues.  Primarily the biggest issue is churn in the 
worker processes.  We try to avoid doing the upgrade a lot because it is not 
truly transparent to all topologies.  They recover, but they have had every one 
of their worker processes shot at least one, and possibly multiple times.  This 
can cause data issues in non-trident topologies, and can slow down the 
processing in trident.
    
    I would recommend that you do it a little differently, and this is what we 
want to move to.
    
    for each node in parallel as much as possible install the new version of 
storm then shoot the supervisor and the logviewer.  Wait for them to all come 
back up, or at least enough that you feel good about it.
    
    Then again as parallel as possible shoot all of the worker processes on all 
of the nodes.
    
    This still has the disadvantage of having all of the worker processes being 
shot and slowing things down, but they are guaranteed to only be shot once, and 
the recovery time should be much faster.  The supervisor relaunches them 
quickly instead of possibly having nimbus time them out and reschedule them on 
a node that has not been upgraded yet.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #414: STORM-634: Storm serialization changed to thrift to suppor...

Reply via email to