[ https://issues.apache.org/jira/browse/FLINK-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884819#comment-16884819 ]
Zhenqiu Huang commented on FLINK-13132: --------------------------------------- [~fly_in_gis] Our concern of the cost is mainly on the pipeline downtime. In our current design, the downloads always from the nearest storages, such as hdfs in prime and s3/GCS for cloud. But we think it is not enough to guarantee our SLA in worst case. If considering start a job in the service, the end to end latency includes download jars, start another process, start session client, upload remote resource, start job cluster, submit joggraph to start the job, etc. It usually takes 1 - 2 minutes for low QPS. If request burst (1000 requests) comes due to some unexpected issue, some of the redeployment requests will be much slower due to the resource competition in the each stage of of job submission. The optimization we want to do is to skip some of the steps (like upload remote resource, job graph generation) in service side, and put the job-graph compilation into ClusterEntrypoints. In this way, download jar can be ignored, and the job graph can be parallelized for each job right after start a cluster, so that even in worst case, we can guarantee our downtime SLA. > Allow ClusterEntrypoints use user main method to generate job graph > ------------------------------------------------------------------- > > Key: FLINK-13132 > URL: https://issues.apache.org/jira/browse/FLINK-13132 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN > Affects Versions: 1.8.0, 1.8.1 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Minor > > We are building a service that can transparently deploy a job to different > cluster management systems, such as Yarn and another internal system. It is > very cost to download the jar and generate JobGraph in the client side. Thus, > I want to propose an improvement to make Yarn Entrypoints can be configurable > to use either FileJobGraphRetriever or ClassPathJobGraphRetriever. It is > actually a long asking TODO in AbstractionYarnClusterDescriptor in line 834. > https://github.com/apache/flink/blob/21468e0050dc5f97de5cfe39885e0d3fd648e399/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L834 -- This message was sent by Atlassian JIRA (v7.6.14#76016)