[
https://issues.apache.org/jira/browse/KARAF-7948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986867#comment-17986867
]
Vineeth commented on KARAF-7948:
--------------------------------
Hi [~jbonofre] ,
Thanks for checking the issue.
Just adding a few more points to this case I have an AIX 7.3 system where I’m
able to replicate the issue using IBM J9 Java 11
>From the Thread dump , i am able to locate one issue in
>*org.apache.karaf.main.Main* class in
{code:java}
public void awaitShutdown() throws Exception {
if (framework == null) {
return;
}
while (true) {
FrameworkEvent event = framework.waitForStop(0);
if (event.getType() == FrameworkEvent.STOPPED_UPDATE) {
if (lock != null) {
lock.release();
}
while (framework.getState() != Bundle.STARTING &&
framework.getState() != Bundle.ACTIVE) {
Thread.sleep(10);
}
monitorThread = monitor();
} else {
return;
}
}
}{code}
The main thread calls {{framework.waitForStop(0)}} with a timeout of 0, meaning
it will wait indefinitely
This calls into ThreadGate.await() in the Felix framework, which uses
Object.wait() to block until notified
The parameter 0 means "wait forever" there's no timeout safety mechanism,
correct?For the main thread to unblock, the ThreadGate object needs to receive
a notification via notify() or notifyAll()
This notification should come from the Felix framework when it completes
shutdown
However, the Felix framework threads (FelixDispatchQueue, FelixFrameworkWiring,
FelixStartLevel) are all waiting themselves.
_*+Here it should have set some timeout value something like 30
seconds(30000).?, What you think? Please feel free to correct me.+*_
In thread dump,
{code:java}
3XMTHREADINFO "main" J9VMThread:0x0000000030010700,
omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0, state:CW,
prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1, isDaemon:false)
3XMJAVALTHRCCL
jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
3XMTHREADINFO1 (native thread ID:0x1320327, native priority:0x5,
native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2 (native stack address range from:0x0000010010017D90,
to:0x000001001009D788, size:0x859F8)
3XMCPUTIME CPU usage total: 0.449041000 secs, user: 0.373210000
secs, system: 0.075831000 secs, current category="Application"
3XMTHREADBLOCK Waiting on:
org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8 Owned by:
<unowned>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at java/lang/Object.waitImpl(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:251)
4XESTACKTRACE at java/lang/Object.wait(Object.java:219)
4XESTACKTRACE at
org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
4XESTACKTRACE at
org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
4XESTACKTRACE at
org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
4XESTACKTRACE at org/apache/karaf/main/Main.main(Main.java:190)
3XMTHREADINFO3 Native callstack:
4XENATIVESTACK _event_wait+0x2c (0x09000000006AF470
[libpthreads.a+0x19470])
4XENATIVESTACK _cond_wait_local+0x2e4 (0x09000000006B75E8
[libpthreads.a+0x215e8])
4XENATIVESTACK _cond_wait+0x34 (0x09000000006B7D38
[libpthreads.a+0x21d38])
4XENATIVESTACK pthread_cond_wait+0x1a8 (0x09000000006B880C
[libpthreads.a+0x2280c])
4XENATIVESTACK IPRA.$monitor_wait_original+0xa10
(0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
4XENATIVESTACK omrthread_monitor_wait_interruptable+0x50
(0x0900000000D89474 [libj9thr29.so+0xa474])
4XENATIVESTACK monitorWaitImpl+0x488 (0x090000000D84668C
[libj9vm29.so+0x8168c])
4XENATIVESTACK (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
4XENATIVESTACK (0x090000000D8A9168 [libj9vm29.so+0xe4168])
4XENATIVESTACK runCallInMethod+0x2d0 (0x090000000D8073F4
[libj9vm29.so+0x423f4])
4XENATIVESTACK _ZL26gpProtectedRunCallInMethodPv+0x4c
(0x090000000D7C66F0 [libj9vm29.so+0x16f0])
4XENATIVESTACK signalProtectAndRunGlue+0x28 (0x090000000D822A8C
[libj9vm29.so+0x5da8c])
4XENATIVESTACK omrsig_protect+0x4fc (0x090000000D555760
[libj9prt29.so+0x5f760])
4XENATIVESTACK gpProtectAndRun+0xf0 (0x090000000D8227F4
[libj9vm29.so+0x5d7f4])
4XENATIVESTACK gpCheckCallin+0x118 (0x090000000D7C665C
[libj9vm29.so+0x165c])
4XENATIVESTACK callStaticVoidMethod+0x44 (0x090000000D8E9728
[libj9vm29.so+0x124728])
4XENATIVESTACK JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
4XENATIVESTACK ThreadJavaMain+0xc (0x000001000000DCB0
[java+0xdcb0])
4XENATIVESTACK (0x0900000000089214 [libpthreads.a+0x214]){code}
I'm not entirely sure how everything is linked together, but it seems to form a
larger race condition, which eventually leads to a hang state in the Karaf
bundles. This issue can be easily reproduced in server mode.
Simply use the {{/bin/start}} script, monitor the status(/bin/status), and once
the server has started completely, run the {{/bin/stop}} script in a loop.
After some iteration the system enters a hang state. I managed to capture
thread and heap dumps during this, which gave me some insights that I’ve shared
here.
Thanks
> Apache karaf got into stuck state
> ---------------------------------
>
> Key: KARAF-7948
> URL: https://issues.apache.org/jira/browse/KARAF-7948
> Project: Karaf
> Issue Type: Bug
> Components: karaf
> Affects Versions: 4.2.16
> Reporter: Vineeth
> Assignee: Jean-Baptiste Onofré
> Priority: Major
>
> We have a customer case where Apache Karaf(agent) is installed as a *systemd*
> [service|https://github.com/apache/karaf/blob/main/assemblies/features/base/src/main/resources/resources/bin/contrib/karaf-service-template.systemd].
> When the virtual machine (VM) is restarted, where Apache Karaf is set up as
> a service, the Karaf process sometimes enters a {*}stuck state{*}. Although
> the PID is still running, Karaf becomes completely unresponsive.
> There is no definite pattern — it happens intermittently. We investigated
> multiple heap dumps but couldn't find any clear clues. However, in one
> instance, we observed *two instances* of the {{FeaturesServiceImpl}} class,
> which suggests that the Activator may have initialized
> {{FeaturesServiceImpl}} {*}twice{*}, even though this only occurred once
> during our tests.
> Later, we checked the state of each bundle. We noticed that the
> mvn:org.apache.karaf.features/org.apache.karaf.features.extension/4.2.16
> bundle and the mvn:org.ops4j.pax.logging/pax-logging-log4j2-extra/1.11.15
> bundle were in *state 2* (resolved), while all other bundles were in *state
> 32* (uninstalled). This points to a possible *race condition* or a low-level
> crash occurring during startup.
> We were able to reproduce the issue by looping a restart of the systemd
> service using {{systemctl restart agent.service}} every 4 minutes. However,
> we have been unable to pinpoint the exact cause of the problem.
> Please review this and let us know if you can help us further diagnose the
> issue.
> Details
> Customer uses Suse Linux on z , installed as rpm package using zypper
> Apache karaf version: 4.2.16
> Log : There no log wriiten after startup. It will simply writes nothing.
> We can share more details, if needed
--
This message was sent by Atlassian Jira
(v8.20.10#820010)