HA for the blobstore was set up so that ZK would hold the source of truth and 
then the other nimbus nodes would be eventually consistent with each other.  
I'm not totally sure of the issue, because I don't understand if this is 
happening in the context of a follower trying to keep up to date, or with a 
leader entering the data.  If it is the latter we need some better fencing to 
prevent multiple leaders trying to write to the DB at the same time.  If it is 
the former we need some better code so the follower can read/update the replica 
it has without the possibility of trying to auto-vivify a node that is being 
deleted.  Generally in these cases we would declare the race safe and then just 
start the sync process over again.


- Bobby

On Monday, January 23, 2017, 2:12:18 AM CST, Jungtaek Lim <[email protected]> 
wrote:Hi devs,

I've been struggling to resolve specific scenario, and found Local
BlobStore cares about Nimbus failure scenarios, but not about removing keys.

For example, I heard that Nimbus crashed in specific scenario, and error
stack trace pointed to below code:
https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L138-L149

checkExists (L138
<https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L138>)
succeeds but getChildren (L149
<https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L149>)
throws NoNodeException, in result sequenceNumbers.last() throws
NoSuchElementException.

We could have a look deeply and make some workarounds, but given that ZK is
accessible from every Nimbuses, we can't ensure every paths are safe.

I guess that BlobStore needs global lock or single controller to handle all
the things right. I'm also open to any workarounds or other ideas.

What do you think?

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply via email to