Vineeth created KARAF-7969:
------------------------------
Summary: Apache Karaf bundles entered a hang state due to a thread
wait
Key: KARAF-7969
URL: https://issues.apache.org/jira/browse/KARAF-7969
Project: Karaf
Issue Type: Bug
Components: karaf
Affects Versions: 4.2.16
Reporter: Vineeth
Hello
I have an AIX 7.3 system where I’m able to replicate the issue using IBM J9
Java 11
>From the Thread dump , i am able to locate one issue in
>*org.apache.karaf.main.Main* class in
public void awaitShutdown() throws Exception { if (framework == null) {
return;
} while (true) {
FrameworkEvent event = framework.waitForStop(0); if
(event.getType() == FrameworkEvent.STOPPED_UPDATE) { if (lock !=
null) \{
lock.release();
} while (framework.getState() != Bundle.STARTING &&
framework.getState() != Bundle.ACTIVE) { Thread.sleep(10);
}
monitorThread = monitor();
} else { return;
}
}
}
The main thread calls {{framework.waitForStop(0)}} with a timeout of 0, meaning
it will wait indefinitely
This calls into ThreadGate.await() in the Felix framework, which uses
Object.wait() to block until notified
The parameter 0 means "wait forever" there's no timeout safety mechanism,
correct?For the main thread to unblock, the ThreadGate object needs to receive
a notification via notify() or notifyAll()
This notification should come from the Felix framework when it completes
shutdown
However, the Felix framework threads (FelixDispatchQueue, FelixFrameworkWiring,
FelixStartLevel) are all waiting themselves.
_*+Here it should have set some timeout value something like 30
seconds(30000).?, What you think? Please feel free to correct me.+*_
In thread dump,
3XMTHREADINFO "main" J9VMThread:0x0000000030010700,
omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0, state:CW,
prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1, isDaemon:false)
3XMJAVALTHRCCL
jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
3XMTHREADINFO1 (native thread ID:0x1320327, native priority:0x5,
native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2 (native stack address range from:0x0000010010017D90,
to:0x000001001009D788, size:0x859F8)
3XMCPUTIME CPU usage total: 0.449041000 secs, user: 0.373210000
secs, system: 0.075831000 secs, current category="Application"3XMTHREADBLOCK
Waiting on: org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8
Owned by: <unowned>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at java/lang/Object.waitImpl(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:251)
4XESTACKTRACE at java/lang/Object.wait(Object.java:219)
4XESTACKTRACE at
org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
4XESTACKTRACE at
org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
4XESTACKTRACE at
org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
4XESTACKTRACE at org/apache/karaf/main/Main.main(Main.java:190)
3XMTHREADINFO3 Native callstack:
4XENATIVESTACK _event_wait+0x2c (0x09000000006AF470
[libpthreads.a+0x19470])
4XENATIVESTACK _cond_wait_local+0x2e4 (0x09000000006B75E8
[libpthreads.a+0x215e8])
4XENATIVESTACK _cond_wait+0x34 (0x09000000006B7D38
[libpthreads.a+0x21d38])
4XENATIVESTACK pthread_cond_wait+0x1a8 (0x09000000006B880C
[libpthreads.a+0x2280c])
4XENATIVESTACK IPRA.$monitor_wait_original+0xa10
(0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
4XENATIVESTACK omrthread_monitor_wait_interruptable+0x50
(0x0900000000D89474 [libj9thr29.so+0xa474])
4XENATIVESTACK monitorWaitImpl+0x488 (0x090000000D84668C
[libj9vm29.so+0x8168c])
4XENATIVESTACK (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
4XENATIVESTACK (0x090000000D8A9168 [libj9vm29.so+0xe4168])
4XENATIVESTACK runCallInMethod+0x2d0 (0x090000000D8073F4
[libj9vm29.so+0x423f4])
4XENATIVESTACK _ZL26gpProtectedRunCallInMethodPv+0x4c
(0x090000000D7C66F0 [libj9vm29.so+0x16f0])
4XENATIVESTACK signalProtectAndRunGlue+0x28 (0x090000000D822A8C
[libj9vm29.so+0x5da8c])
4XENATIVESTACK omrsig_protect+0x4fc (0x090000000D555760
[libj9prt29.so+0x5f760])
4XENATIVESTACK gpProtectAndRun+0xf0 (0x090000000D8227F4
[libj9vm29.so+0x5d7f4])
4XENATIVESTACK gpCheckCallin+0x118 (0x090000000D7C665C
[libj9vm29.so+0x165c])
4XENATIVESTACK callStaticVoidMethod+0x44 (0x090000000D8E9728
[libj9vm29.so+0x124728])
4XENATIVESTACK JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
4XENATIVESTACK ThreadJavaMain+0xc (0x000001000000DCB0
[java+0xdcb0])
4XENATIVESTACK (0x0900000000089214 [libpthreads.a+0x214])
I'm not entirely sure how everything is linked together, but it seems to form a
larger race condition, which eventually leads to a hang state in the Karaf
bundles. This issue can be easily reproduced in server mode.
Simply use the {{/bin/start}} script, monitor the status(/bin/status), and once
the server has started completely, run the {{/bin/stop}} script in a loop.
After some iteration the system enters a hang state. I managed to capture
thread and heap dumps during this, which gave me some insights that I’ve shared
here.
Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)