[ 
https://issues.apache.org/jira/browse/BEAM-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578840#comment-16578840
 ] 

Ankur Goenka edited comment on BEAM-5110 at 8/13/18 7:56 PM:
-------------------------------------------------------------

I agree. We should provide control over SDKHarness container instances 
especially when python container can only utilize a single core at a time.

We can break down the whole container management in 2 parts based on the 
discussion.
 # How to start and manage SDKHarness. (using kubernetes, process based, 
singleton factory etc)
 # Configuration for managing SDKHarenss. ( Number of containers. type of 
containers etc)

This bug tries to address the 1st part using a singleton factory which 
apparently is not behaving as expected in the current code base. We want to 
explore a more robust way of managing containers like kubernetes but that will 
take some time to add and also impose more infrastructure requirements so 
singleton container seems useful.

Shall we track the 2nd part as "Multiple SDKHarness with singleton container 
manager" separately?


was (Author: angoenka):
I agree. We should provide control SDKHarness container instances especially 
when python container can only utilize a single core at a time.

We can break down the whole container management in 2 parts based on the 
discussion.
 # How to start and manage SDKHarness. (using kubernetes, process based, 
singleton factory etc)
 # Configuration for managing SDKHarenss. ( Number of containers. type of 
containers etc)

This bug tries to address the 1st part using a singleton factory which 
apparently is not behaving as expected in the current code base. We want to 
explore a more robust way of managing containers like kubernetes but that will 
take some time to add and also impose more infrastructure requirements so 
singleton container seems useful.

Shall we track the 2nd part as "Multiple SDKHarness with singleton container 
manager" separately?

> Reconile Flink JVM singleton management with deployment
> -------------------------------------------------------
>
>                 Key: BEAM-5110
>                 URL: https://issues.apache.org/jira/browse/BEAM-5110
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Ben Sidhom
>            Assignee: Ben Sidhom
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [~angoenka] noticed through debugging that multiple instances of 
> BatchFlinkExecutableStageContext.BatchFactory are loaded for a given job when 
> executing in standalone cluster mode. This context factory is responsible for 
> maintaining singleton state across a TaskManager (JVM) in order to share SDK 
> Environments across workers in a given job. The multiple-loading breaks 
> singleton semantics and results in an indeterminate number of Environments 
> being created.
> It turns out that the [Flink classloading 
> mechanism|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/debugging_classloading.html]
>  is determined by deployment mode. Note that "user code" as referenced by 
> this link is actually the Flink job server jar. Actual end-user code lives 
> inside of the SDK Environment and uploaded artifacts.
> In order to maintain singletons without resorting to IPC (for example, using 
> file locks and/or additional gRPC servers), we need to force non-dynamic 
> classloading. For example, this happens when jobs are submitted to YARN for 
> one-off deployments via `flink run`. However, connecting to an existing 
> (Flink standalone) deployment results in dynamic classloading.
> We should investigate this behavior and either document (and attempt to 
> enforce) deployment modes that are consistent with our requirements, or (if 
> possible) create a custom classloader that enforces singleton loading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to