[ 
https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694
 ] 

Nicolas Fraison edited comment on FLINK-35489 at 5/30/24 12:07 PM:
-------------------------------------------------------------------

Thks [~fanrui] for the feedback it help me realise that my analysis was wrong.

The issue we are facing is the JVM crashing after the 
[autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/]
 change some memory config:
{code:java}
Starting kubernetes-taskmanager as a console application on host 
flink-kafka-job-apache-right-taskmanager-1-1.
Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: 
"result" with message agent load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
FATAL ERROR in native method: processing of -javaagent failed, processJavaStart 
failed
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
V  [libjvm.so+0x78dee4]  jni_FatalError+0x70
V  [libjvm.so+0x88df00]  JvmtiExport::post_vm_initialized()+0x240
V  [libjvm.so+0xc353fc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac
V  [libjvm.so+0x79c05c]  JNI_CreateJavaVM+0x7c
C  [libjli.so+0x3b2c]  JavaMain+0x7c
C  [libjli.so+0x7fdc]  ThreadJavaMain+0xc
C  [libpthread.so.0+0x7624]  start_thread+0x184 {code}
Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that 
the memory.managed.size was shrink to 0b make me thing that it was linked to 
missing off heap.

But you are right that jvm-overhead already reserved some memory for the off 
heap (and we indeed have around 400 MB with that config)

So looking back to the new config I've identified the issue which is on the 
jvm-metaspace having been shrink to 22MB while it was set at 256MB.
I've done a test increasing this parameter and the TM is now able to start.

For the meta space computation size I can see the autotuning computing 
METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace 
sizing.

But due to the the memBudget management it ends up setting only 22MB to the 
metaspace ([first allocate remaining memory to the heap and then this new 
remaining to metaspace and finally to managed 
memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130])

 


was (Author: JIRAUSER299678):
Thks [~fanrui] for the feedback it help me realise that my analysis was wrong.

The issue we are facing ifs the JVM crashing after the 
[autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/]
 change some memory config:

 
{code:java}
Starting kubernetes-taskmanager as a console application on host 
flink-kafka-job-apache-right-taskmanager-1-1.
Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: 
"result" with message agent load/premain call failed at 
src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422
FATAL ERROR in native method: processing of -javaagent failed, processJavaStart 
failed
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, 
Vv=VM code, C=native code)
V  [libjvm.so+0x78dee4]  jni_FatalError+0x70
V  [libjvm.so+0x88df00]  JvmtiExport::post_vm_initialized()+0x240
V  [libjvm.so+0xc353fc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac
V  [libjvm.so+0x79c05c]  JNI_CreateJavaVM+0x7c
C  [libjli.so+0x3b2c]  JavaMain+0x7c
C  [libjli.so+0x7fdc]  ThreadJavaMain+0xc
C  [libpthread.so.0+0x7624]  start_thread+0x184 {code}
Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that 
the memory.managed.size was shrink to 0b make me thing that it was linked to 
missing off heap.

But you are right that jvm-overhead already reserved some memory for the off 
heap (and we indeed have around 400 MB with that config)

So looking back to the new config I've identified the issue which is on the 
jvm-metaspace having been shrink to 22MB while it was set at 256MB.
I've done a test increasing this parameter and the TM is now able to start.

For the meta space computation size I can see the autotuning computing 
METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace 
sizing.

But due to the the memBudget management it ends up setting only 22MB to the 
metaspace ([first allocate remaining memory to the heap and then this new 
remaining to metaspace and finally to managed 
memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130])

 

> Metaspace size can be too little after autotuning change memory setting
> -----------------------------------------------------------------------
>
>                 Key: FLINK-35489
>                 URL: https://issues.apache.org/jira/browse/FLINK-35489
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: 1.8.0
>            Reporter: Nicolas Fraison
>            Priority: Major
>
> We have enable the autotuning feature on one of our flink job with below 
> config
> {code:java}
> # Autoscaler configuration
> job.autoscaler.enabled: "true"
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.metrics.window: 10m
> job.autoscaler.target.utilization: "0.8"
> job.autoscaler.target.utilization.boundary: "0.1"
> job.autoscaler.restart.time: 2m
> job.autoscaler.catch-up.duration: 10m
> job.autoscaler.memory.tuning.enabled: true
> job.autoscaler.memory.tuning.overhead: 0.5
> job.autoscaler.memory.tuning.maximize-managed-memory: true{code}
> During a scale down the autotuning decided to give all the memory to to JVM 
> (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b.
> Here is the config that was compute by the autotuning for a TM running on a 
> 4GB pod:
> {code:java}
>     taskmanager.memory.network.max: 4063232b
>     taskmanager.memory.network.min: 4063232b
>     taskmanager.memory.jvm-overhead.max: 433791712b
>     taskmanager.memory.task.heap.size: 3699934605b
>     taskmanager.memory.framework.off-heap.size: 134217728b
>     taskmanager.memory.jvm-metaspace.size: 22960020b
>     taskmanager.memory.framework.heap.size: "0 bytes"
>     taskmanager.memory.flink.size: 3838215565b
>     taskmanager.memory.managed.size: 0b {code}
> This has lead to some issue starting the TM because we are relying on some 
> javaagent performing some memory allocation outside of the JVM (rely on some 
> C bindings).
> Tuning the overhead or disabling the scale-down-compensation.enabled could 
> have helped for that particular event but this can leads to other issue as it 
> could leads to too little HEAP size being computed.
> It would be interesting to be able to set a min memory.managed.size to be 
> taken in account by the autotuning.
> What do you think about this? Do you think that some other specific config 
> should have been applied to avoid this issue?
>  
> Edit see this comment that leads to the metaspace issue: 
> https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to