lucabro81 opened a new issue, #193:
URL: https://github.com/apache/openserverless/issues/193

   ### Ⅰ. Issue Description
   
   OpenServerless installation fails on Ubuntu 24.04 with kernel 6.14 due to 
JVM cgroup v2 compatibility issue. Kafka and Controller pods crash with 
`NullPointerException` in `CgroupV2Subsystem.getInstance`.
   
   
   ### Ⅱ. Describe what happened
   
   Kafka pod crashes immediately on startup with the following exception:
   ```
   Exception in thread "main" java.lang.reflect.InvocationTargetException
       at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
       at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
       at java.base/java.lang.reflect.Method.invoke(Unknown Source)
       at 
java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(Unknown
 Source)
       at 
java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(Unknown
 Source)
   Caused by: java.lang.NullPointerException
       at 
java.base/jdk.internal.platform.cgroupv2.CgroupV2Subsystem.getInstance(Unknown 
Source)
       at java.base/jdk.internal.platform.CgroupSubsystemFactory.create(Unknown 
Source)
       at java.base/jdk.internal.platform.CgroupMetrics.getInstance(Unknown 
Source)
       at java.base/jdk.internal.platform.SystemMetrics.instance(Unknown Source)
       at java.base/jdk.internal.platform.Metrics.systemMetrics(Unknown Source)
       at java.base/jdk.internal.platform.Container.metrics(Unknown Source)
       at 
jdk.management/com.sun.management.internal.OperatingSystemImpl.<init>(Unknown 
Source)
       at 
jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl.getOperatingSystemMXBean(Unknown
 Source)
       at 
jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl$3.nameToMBeanMap(Unknown
 Source)
       at 
java.management/sun.management.spi.PlatformMBeanProvider$PlatformComponent.getMBeans(Unknown
 Source)
       at 
java.management/java.lang.management.ManagementFactory.getPlatformMXBean(Unknown
 Source)
       at 
java.management/java.lang.management.ManagementFactory.getOperatingSystemMXBean(Unknown
 Source)
       at 
io.prometheus.jmx.shaded.io.prometheus.client.hotspot.StandardExports.<init>(StandardExports.java:43)
       at 
io.prometheus.jmx.shaded.io.prometheus.client.hotspot.DefaultExports.register(DefaultExports.java:37)
       at 
io.prometheus.jmx.shaded.io.prometheus.client.hotspot.DefaultExports.initialize(DefaultExports.java:28)
       at io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:30)
       ... 6 more
   *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent 
load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
   FATAL ERROR in native method: processing of -javaagent failed, 
processJavaStart failed
   ```
   
   The Controller pod never starts - installation hangs indefinitely waiting 
for `pod/controller-0` to become available.
   
   ### Ⅲ. Describe what you expected to happen
   
   Installation should complete successfully with all pods running and healthy, 
allowing user creation and login.
   
   ### Ⅳ. How to reproduce it (as minimally and precisely as possible)
   
   1. Install Ubuntu 24.04 on hardware with NVIDIA kernel 6.14
   2. Configure ops with minimal settings
   3. Run installation
   4. Installation fails with timeout waiting for controller-0
   
   ### Ⅴ. Anything else we need to know?
   
   **Root Cause Analysis:**
   
   This is a known issue with JVM versions prior to JDK 21 running on Linux 
kernel 6.12+ with cgroup v2. The problem occurs because:
   
   1. Ubuntu 24.04 uses kernel 6.x + systemd 256+ which enforces cgroup v2
   2. The memory cgroup controller is missing from `/proc/cgroups`:
   ```bash
      $ cat /proc/cgroups
      #subsys_name    hierarchy    num_cgroups    enabled
   cpu    0    860    1
   cpuacct    0    860    1
   blkio    0    860    1
   devices    0    860    1
   freezer    0    860    1
   net_cls    0    860    1
   perf_event    0    860    1
   net_prio    0    860    1
   hugetlb    0    860    1
   pids    0    860    1
   rdma    0    860    1
   misc    0    860    1
   dmem    0    860    1
   ```
      Note: No "memory" controller present
   3. JVM's `CgroupV2Subsystem.getInstance()` expects the memory controller and 
crashes with NPE when it's missing
   
   **Temporary Workaround:**
   
   Force cgroup v1 by adding kernel boot parameter:
   ```bash
   # In /etc/default/grub:
   GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
   # Then: sudo update-grub && sudo reboot
   ```
   
   **Permanent Solution:**
   
   Update Docker images to use JDK 21+ or apply JVM patches for cgroup v2 
compatibility. Similar issues have been fixed in:
   - Elasticsearch 7.17.26+
   - OpenJDK bug tracker: https://bugs.openjdk.org/browse/JDK-8287107
   
   ### Ⅵ. Environment:
   
   - K8S Runtime and version: k3s (installed by ops)
   - OPS CLI version: 0.1.0-2409121919.dev
   - OS: Ubuntu 24.04 LTS
   - Kernel: 6.14.0-1015-nvidia
   - Hardware: Dell Pro Max GB10 (NVIDIA GB10 Grace CPU + Blackwell GPU)
   - Java version on host: OpenJDK 1.8.0_472
   - Cgroup version: v2 (enforced by kernel 6.14)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to