Github user HeartSaVioR commented on the issue:
https://github.com/apache/storm/pull/1574
@revans2
And local BlobStore should be designed to achieve hive availability just
same as HDFS BlobStore. But the process BlobStore is behind is Nimbus, which is
designed to fail-fast, which I think is just not same way.
For example, suppose the scenario what I addressed from STORM-1977.
Tested with the scenario I've described from STORM-1977
1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick
workaround for resolving STORM-1976), and patch Storm cluster
2. Launch Nimbus 1 (leader)
3. Run topology1
4. Kill Nimbus 1
5. Launch Nimbus 2 from different node
Without having condition for granting leadership, Nimbus 2 can grant
leadership, and act as leader. This is not a blocker for BlobStore's view,
since replication count for topology1 is 0 but it doesn't make them crashed,
and reviving Nimbus 1 should eventually replicate topology 1 to Nimbus 2.
The thing is, leader nimbus should do the own work as Nimbus. In this case
just requesting getClusterInfo can make Nimbus 2 crashed, and Nimbus 1 comes in
later and gain leadership, but replication count for topology1 is still 1 until
Nimbus 2 comes in.
With having condition for granting leadership, Nimbus 2 gives up
leadership, and continuously waits for new leader. (No leader at that time) And
Nimbus 1 comes in, and topology1 is eventually replicated to Nimbus 2 to ensure
replication count.
Due to this behavior, the crash and recovery scenario heavily depends on
sequence of launching Nimbuses. I think this is not a good UX.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---