[
https://issues.apache.org/jira/browse/FLINK-21552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293477#comment-17293477
]
Dian Fu commented on FLINK-21552:
---------------------------------
Thanks [~xintongsong] for the analysis. Agree with you. I think we should
handle properly the case when exception was thrown in initializer.apply().
cc [~sewen]
> The managed memory was not released if exception was thrown in
> createPythonExecutionEnvironment
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-21552
> URL: https://issues.apache.org/jira/browse/FLINK-21552
> Project: Flink
> Issue Type: Bug
> Components: API / Python, Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Assignee: Xintong Song
> Priority: Major
> Fix For: 1.13.0, 1.12.3
>
>
> If there is exception thrown in
> [createPythonExecutionEnvironment|https://github.com/apache/flink/blob/3796e59f79a90bd8ad5e6fc37458e2d6cce23139/flink-python/src/main/java/org/apache/flink/streaming/api/runners/python/beam/BeamPythonFunctionRunner.java#L248],
> the job will failed with the following exception:
> {code:java}
> org.apache.flink.runtime.memory.MemoryAllocationException: Could not created
> the shared memory resource of size 611948962. Not enough memory left to
> reserve from the slot's managed memory.
> at
> org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:536)
> at
> org.apache.flink.runtime.memory.SharedResources.createResource(SharedResources.java:126)
> at
> org.apache.flink.runtime.memory.SharedResources.getOrAllocateSharedResource(SharedResources.java:72)
> at
> org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:555)
> at
> org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:250)
> at
> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:113)
> at
> org.apache.flink.table.runtime.operators.python.AbstractStatelessFunctionOperator.open(AbstractStatelessFunctionOperator.java:116)
> at
> org.apache.flink.table.runtime.operators.python.scalar.AbstractPythonScalarFunctionOperator.open(AbstractPythonScalarFunctionOperator.java:88)
> at
> org.apache.flink.table.runtime.operators.python.scalar.AbstractRowDataPythonScalarFunctionOperator.open(AbstractRowDataPythonScalarFunctionOperator.java:70)
> at
> org.apache.flink.table.runtime.operators.python.scalar.RowDataPythonScalarFunctionOperator.open(RowDataPythonScalarFunctionOperator.java:59)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:428)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:543)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:533)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
> at java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could
> not allocate 611948962 bytes, only 0 bytes are remaining. This usually
> indicates that you are requesting more memory than you have reserved.
> However, when running an old JVM version it can also be caused by slow
> garbage collection. Try to upgrade to Java 8u72 or higher if running on an
> old Java version.
> at
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:170)
> at
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:84)
> at
> org.apache.flink.runtime.memory.MemoryManager.reserveMemory(MemoryManager.java:423)
> at
> org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:534)
> ... 17 more
> {code}
> The reason is that the reserved managed memory was not added back to the
> MemoryManager when Job failed because of exceptions thrown in
> createPythonExecutionEnvironment. This causes that there is no managed memory
> to allocate during failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)