[ 
https://issues.apache.org/jira/browse/FLINK-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880502#comment-16880502
 ] 

Zhenqiu Huang commented on FLINK-13132:
---------------------------------------

[~maguowei] [~till.rohrmann]
The reason I want to proposal the change is that we are managing 1000+ 
production jobs for our customers. As a platform, we need to restart all of the 
jobs within 5 minutes according to our SLA, no matter it is a cluster 
maintenance or it is a transient infrastructure failure.

Beside this, we are moving to a hybrid cloud architecture. So we need to run 
jobs to yarn and another internal cluster management scheduler on top of mesos 
and k8 for future. Currently, our customers upload jars to a storage management 
layer that cross multiple storage systems (for internal and cloud), so that 
they can be secure and efficiently access in both environment. We say cost to 
download jar into the service and generate JobGraph in client side is that we 
need to start a process for each new job within the service. From our 
profiling, it is not scalable enough, so that a large number of instance of the 
service are needed for the worst case. But the regular QPS for it is just 10 
per second. Thus, we want to further optimize the job submission by push the 
job graph generation onto ClusterEntrypoint. By this way, the average job 
submission time will be reduced with much less resources.





















> Allow ClusterEntrypoints use user main method to generate job graph
> -------------------------------------------------------------------
>
>                 Key: FLINK-13132
>                 URL: https://issues.apache.org/jira/browse/FLINK-13132
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.8.0, 1.8.1
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Minor
>
> We are building a service that can transparently deploy a job to different 
> cluster management systems, such as Yarn and another internal system. It is 
> very cost to download the jar and generate JobGraph in the client side. Thus, 
> I want to propose an improvement to make Yarn Entrypoints can be configurable 
> to use either FileJobGraphRetriever or ClassPathJobGraphRetriever. It is 
> actually a long asking TODO in AbstractionYarnClusterDescriptor in line 834.
> https://github.com/apache/flink/blob/21468e0050dc5f97de5cfe39885e0d3fd648e399/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L834



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to