Diogo Pereira created STORM-4154:
------------------------------------
Summary: Nimbus down following topology deployment
Key: STORM-4154
URL: https://issues.apache.org/jira/browse/STORM-4154
Project: Apache Storm
Issue Type: Bug
Components: storm-server
Affects Versions: 2.7.0
Reporter: Diogo Pereira
When deploying or terminating a topology in a distributed cluster, we
occasionally encounter downtime on the Nimbus machines. Below is an example
stack trace:
{code:java}
2025-01-07T08:56:48.088Z Utils [ERROR] Received error in thread
BLOB-STORE-TIMER.. terminating server...
java.lang.Error: java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException:
Internal error processing createStateInZookeeper
at org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:666)
~[storm-client-2.7.0.jar:2.7.0]
at org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:670)
~[storm-client-2.7.0.jar:2.7.0]
at
org.apache.storm.utils.Utils.lambda$createDefaultUncaughtExceptionHandler$2(Utils.java:1053)
~[storm-client-2.7.0.jar:2.7.0]
at
java.base/java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1055) [?:?]
at
java.base/java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1050) [?:?]
at
java.base/java.lang.Thread.dispatchUncaughtException(Thread.java:1997) [?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException:
Internal error processing createStateInZookeeper
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:199)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.storm.thrift.TApplicationException: Internal error processing
createStateInZookeeper
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.syncBlobs(LocalFsBlobStoreSynchronizer.java:106)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore.blobSync(LocalFsBlobStore.java:174)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:197)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException:
Internal error processing createStateInZookeeper
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.updateKeySetForBlobStore(LocalFsBlobStoreSynchronizer.java:128)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.syncBlobs(LocalFsBlobStoreSynchronizer.java:84)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore.blobSync(LocalFsBlobStore.java:174)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:197)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.storm.thrift.TApplicationException: Internal error processing
createStateInZookeeper
at
org.apache.storm.blobstore.BlobStoreUtils.updateKeyForBlobStore(BlobStoreUtils.java:285)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.updateKeySetForBlobStore(LocalFsBlobStoreSynchronizer.java:125)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.syncBlobs(LocalFsBlobStoreSynchronizer.java:84)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore.blobSync(LocalFsBlobStore.java:174)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:197)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?]
Caused by: java.lang.RuntimeException:
org.apache.storm.thrift.TApplicationException: Internal error processing
createStateInZookeeper
at
org.apache.storm.blobstore.NimbusBlobStore.createStateInZookeeper(NimbusBlobStore.java:139)
~[storm-client-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.BlobStoreUtils.createStateInZookeeper(BlobStoreUtils.java:242)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.BlobStoreUtils.updateKeyForBlobStore(BlobStoreUtils.java:279)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.updateKeySetForBlobStore(LocalFsBlobStoreSynchronizer.java:125)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.syncBlobs(LocalFsBlobStoreSynchronizer.java:84)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore.blobSync(LocalFsBlobStore.java:174)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:197)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?]
Caused by: org.apache.storm.thrift.TApplicationException: Internal error
processing createStateInZookeeper
at
org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:81)
~[storm-shaded-deps-2.7.0.jar:2.7.0]
at
org.apache.storm.generated.Nimbus$Client.recv_createStateInZookeeper(Nimbus.java:1065)
~[storm-client-2.7.0.jar:2.7.0]
at
org.apache.storm.generated.Nimbus$Client.createStateInZookeeper(Nimbus.java:1052)
~[storm-client-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.NimbusBlobStore.createStateInZookeeper(NimbusBlobStore.java:136)
~[storm-client-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.BlobStoreUtils.createStateInZookeeper(BlobStoreUtils.java:242)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.BlobStoreUtils.updateKeyForBlobStore(BlobStoreUtils.java:279)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.updateKeySetForBlobStore(LocalFsBlobStoreSynchronizer.java:125)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStoreSynchronizer.syncBlobs(LocalFsBlobStoreSynchronizer.java:84)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore.blobSync(LocalFsBlobStore.java:174)
~[storm-server-2.7.0.jar:2.7.0]
at
org.apache.storm.blobstore.LocalFsBlobStore$1.run(LocalFsBlobStore.java:197)
~[storm-server-2.7.0.jar:2.7.0]
at java.base/java.util.TimerThread.mainLoop(Timer.java:556) ~[?:?]
at java.base/java.util.TimerThread.run(Timer.java:506) ~[?:?] {code}
h3. Root Cause
This issue occurs due to a race condition when syncing the blobs. On some
machines, the key we are trying to fetch information for during the process of
creating the state in ZooKeeper for a recently downloaded blob might disappear.
This results in a Thrift exception that is not being handled properly, causing
the Nimbus process to crash.
The issue lies more specifically in this function:
{code:java}
public void createStateInZookeeper(String key) throws TException {
try {
IStormClusterState state = stormClusterState;
BlobStore store = blobStore;
NimbusInfo ni = nimbusHostPortInfo;
if (store instanceof LocalFsBlobStore) {
state.setupBlob(key, ni, getVersionForKey(key, ni, zkClient));
}
LOG.debug("Created state in zookeeper {} {} {}", state, store, ni);
} catch (Exception e) {
LOG.warn("Exception while creating state in zookeeper - key: " + key,
e);
if (e instanceof TException) {
throw (TException) e;
}
throw new RuntimeException(e);
}
} {code}
Here the {{getVersionForKey}} method can throw a {{{}KeyNotFoundException{}}},
which is not being handled properly. Instead, it is simply wrapped in a
{{RuntimeException. }}This exception then cascades until the syncBlob function,
that doens't handle the error ultimately causing the main thread to
terminate.{{{}{}}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)