More details: there were three Nimbuses N1, N2, N3 and N1 was the leader.
We submitted topology T1 and after submission we restarted N1. N2 got
leader and we killed T1. While N1 is initializing and syncing up its
topology blobs, N2 concurrently removes the ZK path and also max sequence
number path for topology blob in progress of killing topology. This race
condition is only occurring on Local BlobStore since removing ZK path is
done only if Nimbus is using Local BlobStore.

So it's the former case, and stopping current sync phase and restarting
sync is an ideal way since we're just guaranteeing eventually consistent.
I'll take a look at the codebase to see how we can apply, but it should be
great help for me if someone is familiar with BlobStore codebase and
willing to handle it.

Thanks,
Jungtaek Lim (HeartSaVioR)

2017년 1월 23일 (월) 오후 11:33, Bobby Evans <[email protected]>님이 작성:

HA for the blobstore was set up so that ZK would hold the source of truth
and then the other nimbus nodes would be eventually consistent with each
other.  I'm not totally sure of the issue, because I don't understand if
this is happening in the context of a follower trying to keep up to date,
or with a leader entering the data.  If it is the latter we need some
better fencing to prevent multiple leaders trying to write to the DB at the
same time.  If it is the former we need some better code so the follower
can read/update the replica it has without the possibility of trying to
auto-vivify a node that is being deleted.  Generally in these cases we
would declare the race safe and then just start the sync process over again.


- Bobby


On Monday, January 23, 2017, 2:12:18 AM CST, Jungtaek Lim <[email protected]>
wrote:
Hi devs,

I've been struggling to resolve specific scenario, and found Local
BlobStore cares about Nimbus failure scenarios, but not about removing keys.

For example, I heard that Nimbus crashed in specific scenario, and error
stack trace pointed to below code:
https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L138-L149

checkExists (L138
<
https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L138
>)
succeeds but getChildren (L149
<
https://github.com/apache/storm/blob/1.x-branch/storm-core/src/jvm/org/apache/storm/blobstore/KeySequenceNumber.java#L149
>)
throws NoNodeException, in result sequenceNumbers.last() throws
NoSuchElementException.

We could have a look deeply and make some workarounds, but given that ZK is
accessible from every Nimbuses, we can't ensure every paths are safe.

I guess that BlobStore needs global lock or single controller to handle all
the things right. I'm also open to any workarounds or other ideas.

What do you think?

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply via email to