Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/1574
  
    @revans2 
    I'm just thinking about the responsibility of Nimbus. 
    
    - Nimbus was "soft" SPOF and we claimed that Nimbus is designed to 
fail-fast and stateless so just supervising Nimbus works like a charm. But it 
doesn't help from machine failure, and moving Nimbus to other machine requires 
at least configuration change of whole cluster. (This assumes that Supervisor 
is also supervised by aux. process. If not starting Supervisor should be done 
manually.)
    - Nimbus H/A came in. It was relatively easier than other process on other 
project since Nimbus is designed as stateless so no need to sync. Only thing 
Nimbuses should sync up is topology code, and Nimbus H/A tried to address this 
by full replications and restriction of becoming leader. It made some overhead 
trying to replicate topology codes to all of Nimbuses but it was a best try to 
achieve higher availability. When other nimbuses crashed but only one 'leader' 
nimbus was alive, that was completely OK for that moment. There was a chance 
for all alive nimbuses not having complete set of topology code thus no leader 
and hang, but it was relatively smaller than counting replication count since 
it was doing full replication at all.
    - BlobStore came in. I don't know the details of BlobStore so hard to tell. 
I'd be happy if you fill out this : After BlobStore.
    
    One thing I'm concerning is, there's new requirement for Nimbus to not 
easily crashed since every Nimbuses are also replica of BlobStore like 
DataNode, but Nimbus itself has lots of works to do (sure for leader, and not 
sure for followers) and it is still based on fail-fast. Is it OK to play 
together?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to