[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2021-04-23 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330183#comment-17330183
 ] 

Till Rohrmann commented on FLINK-13958:
---

Might have been partially solved via FLINK-16408.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Morávek
>Priority: Major
>  Labels: stale-major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328301#comment-17328301
 ] 

Flink Jira Bot commented on FLINK-13958:


This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Morávek
>Priority: Major
>  Labels: stale-major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925516#comment-16925516
 ] 

Stephan Ewen commented on FLINK-13958:
--

We had a similar problem initially with RocksDB.

The strange thing is that the JVM can load a library once under a given file 
name only, but multiple times under different file names. So in RocksDB, we 
rename the library file to something random and then it can be linked multiple 
times.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923320#comment-16923320
 ] 

Till Rohrmann commented on FLINK-13958:
---

I think you are describing a valid problem here. Unfortunately, I don't have 
good idea for a general solution at the moment. 

For the per-job mode, it could mean to not create a new user code class loader. 
There have been ideas to bind the user code class loader to the lifecycle of a 
slot. As long as the slot is still allocated to a {{JobMaster}}, then the 
system should not free the class loader. However, this would also not solve all 
problems, because the {{TaskExecutor}} could lose its connection to the 
{{JobMaster}} which causes the slot to be freed. After reconnecting to the 
{{JobMaster}} it would then create a new class loader.

For the session mode I think it is super tricky because the system could try to 
deploy tasks, belonging to two jobs, to the same {{TaskExecutor}} both of which 
trying to load the same C library.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-05 Thread David Moravek (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923264#comment-16923264
 ] 

David Moravek commented on FLINK-13958:
---

Hi Till, we're using attached mode on yarn. We'll try the detached mode instead 
and let you know. Anyway, do you think this is an issue worth fixing in 
general? If so, what would be the correct approach?

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923160#comment-16923160
 ] 

Till Rohrmann commented on FLINK-13958:
---

Thanks for reporting this issue [~davidmoravek]. I think this issue should also 
arise with any other restart settings and also with streaming if I'm not 
mistaken.

A quick question concerning the per-job mode. Are you using the per job mode on 
Yarn? If yes, do you submit the job in detached or attached mode? If it should 
be the latter, then Flink actually deploys a session cluster underneath. This 
is for legacy reasons. I'm asking because at the moment, the per job mode 
(submitting a job in detached mode on Yarn or using the container per job mode) 
should place all dependencies on the system class path (this has other problems 
as it does not support child first class loading atm).

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-04 Thread David Moravek (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922554#comment-16922554
 ] 

David Moravek commented on FLINK-13958:
---

[~1u0] It's unrelated issue. The behavior you are describing is expected when 
you submit multiple jobs into the same cluster. Only option in that case is to 
workaround using system class loader (it's technically not a workaround).

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

2019-09-04 Thread Alex (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922545#comment-16922545
 ] 

Alex commented on FLINK-13958:
--

I think this has the the same root cause as FLINK-11402. Specifically, that we 
cannot load a native library more than once in the same JVM process.

> Job class loader may not be reused after batch job recovery
> ---
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: David Moravek
>Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)