[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943801#comment-16943801 ] Kenneth Knowles commented on BEAM-8183: --- OK, even after browsing the thread (I don't have time to pick it all up) I don't quite understand the context. Ankur clarified what I meant to describe. All of the above forms of flexibility are in use and important. I would guess multi-pipeline is the least important since that is a pretty trivial convenience. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943638#comment-16943638 ] Thomas Weise commented on BEAM-8183: BEAM-8115 is indeed orthogonal and applicable for the cases where parameterization can be solved w/o different execution path in the driver program. For the remaining cases bundling multiple protos could be the solution. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943213#comment-16943213 ] Kyle Weaver commented on BEAM-8183: --- For pipeline options, I filed https://issues.apache.org/jira/browse/BEAM-8115, which should be mostly orthogonal to the issue of multiple pipelines. Also, if we do decide to support multi-pipeline jars, we might just do it in Java, because it will require some Java changes anyway, and Java's jar utility libraries are a lot less clunky than shell. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943201#comment-16943201 ] Ankur Goenka commented on BEAM-8183: I agree, changing a bit of configuration in the proto will serve a lot of use cases. A few can be the input/output data file etc. {quote} You are correct that the Python entry point / driver program would need to be (re)executed for a fully generic solution. But that's not necessary for the majority of use cases. Those are artifact + configuration. If there is a way to parameterize configuration values in the proto, we can address that majority of use cases with a single job jar artifact. {quote} Will [value_provider|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/value_provider.py]] help in this case? Dataflow templates use this. Also, we can enhance the driver class to swap the actual option values in the options proto to parameters provided at the submission time. {quote} But beyond that we also have (in our infrastructure) the use case of multiple entry points that the user can pick at submit time. {quote} Thats a valid usecase. I can't imagine a good way to model it in beam as all the beam notions are build considering a single pipeline at a time. Will a shell script capable of merging merging the jars for different pipeline. I think a pipeline docker can resolve a lot of these issues as it will be capable of running the submission code in a consistent manner based on the arguments provided. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943170#comment-16943170 ] Thomas Weise commented on BEAM-8183: {quote}The major issue to me seems to be that we need to execute pipeline construction code which is environment dependent. To generate new pipelines for an environment, we need to execute the pipeline submission code in that environment. And this is where I see a problem. Python pipelines have to execute user code in python using python sdk to construct the pipeline. {quote} You are correct that the Python entry point / driver program would need to be (re)executed for a fully generic solution. But that's not necessary for the majority of use cases. Those are artifact + configuration. If there is a way to parameterize configuration values in the proto, we can address that majority of use cases with a single job jar artifact. My fallback for the exception path would be to generate multiple protos into a single jar, which is why I'm interested in this capability. So that jar would contain "mypipeline_staging" and "mypipeline_production" and the deployment would select the pipeline via its configuration (parameter to the Flink entry point). Similar would work for Spark. But beyond that we also have (in our infrastructure) the use case of multiple entry points that the user can pick at submit time. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943161#comment-16943161 ] Ankur Goenka commented on BEAM-8183: I see. Thanks for explaining the use case. I think hardcoded pipeline options are definitely a limitation as of now. We can see how we can use Beam's ValueProvider to give dynamic arguments. We can also think of overwriting the pipeline options when submitting the jar to fink. {quote}Running the same pipeline in different environments with different parameters is a common need. Virtually everyone has dev/staging/prod or whatever their environments are and they want to use the same build artifact. That normally requires some amount of parameterization. {quote} I don't really have a good solution for dev/staging/prod use case. This is not going to be solved by jar with multiple pipelines (as each pipeline will have a static set of pipeline options) but by a jar creating dynamic pipelines (as the pipeline changes based on the pipeline options and environment). The major issue to me seems to be that we need to execute pipeline construction code which is environment dependent. To generate new pipelines for an environment, we need to execute the pipeline submission code in that environment. And this is where I see a problem. Python pipelines have to execute user code in python using python sdk to construct the pipeline. Considering this jar as the artifact would not be idle for different environment as the actual sdk/lib etc can differ between environments. From environment point of view, a docker container capable of submitting the pipeline should be an artifact as it has all the dependencies bundled in it and is capable of executing code with consistent dependencies. And if we don't want consistent dependency across environment, then pipeline code should be considered as an artifact as it can work with different dependencies. For context, In dataflow we pack multiple pipeline in a single jar for java and for python we generate separate par for each pipeline (We do publish them as a single mpm). Further, this does not materialize the pipeline but create an executable which is later used in an environment having the right sdk installed. The submission process just runs the "python test_pipeline.par --runner=DataflowRunner --apiary=testapiary" which goes though dataflow job submission api and is submitted as a regular dataflow job. This is similar to docker model just that instead of docker we use par file and execute it using python/java. {quote}The other use case is bundling multiple pipelines into the same container and select which to run at launch time. {quote} This will save some space at the time of deployment. Specifically the jobserver jar and pipeline staged artifacts if they are shared. We don't really introspect the staged artifacts so we don't know what can be shared and what can't across pipelines. I think a better approach would be to just write a separate script to merge multiple pipeline jars (jar with single pipeline) and replace main class to consider the name of the pipeline to pick the right proto. The script can be infrastructure infrastructure aware and can make the appropriate lib changes. Beam does not have a notion of multiple pipelines in any sense so it will be interesting to see how we model this if we decide to introduce it in beam. Note: As the pipelines are materialized, they will still not work across environments. Please let me know if you have any ideas for solution this. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943015#comment-16943015 ] Thomas Weise commented on BEAM-8183: For context: [https://lists.apache.org/thread.html/2122928a0a5f678d475ec15af538eb7303f73557870af174b1fdef7e@%3Cdev.beam.apache.org%3E] Running the same pipeline in different environments with different parameters is a common need. Virtually everyone has dev/staging/prod or whatever their environments are and they want to use the same build artifact. That normally requires some amount of parameterization. The other use case is bundling multiple pipelines into the same container and select which to run at launch time. I was surprised about the question given prior discussion. Even more so considering that Beam already has the concept of user options. The approach of generating the jar file currently is equivalent to hard coding all pipeline options and asking the user to recompile. Yes, we could generate a new jar file for every option or environment but please not it bloats the container images (job server is > 100MB). We can also create separate Docker images, now we are in the GB range. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942974#comment-16942974 ] Kenneth Knowles commented on BEAM-8183: --- I may not understand the issue here, but I can share that for internal Dataflow testing we definitely make a huge jar with all the test pipelines in it. > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941330#comment-16941330 ] Ankur Goenka commented on BEAM-8183: Flink has some neat feature of picking the pipeline on the fly. I don't think this is a very common usecase with Beam though. Given Beam Job Submission api work on a single pipeline at a time it will be very convoluted work flow to introduce multiple pipelines in a single jar. Will it be easier to just store pipelines as separate jar in global storage (hdfs etc) and pass the right jar at the time of pipeline submission? In case the submission happen through a service then will it be easier and less error prone to just keep these jars separately on the service and submit the right jar to flink based on the parameter? > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar
[ https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939795#comment-16939795 ] Thomas Weise commented on BEAM-8183: Hi Kyle, I have setup my Flink image build to use the jar file runner, but currently it can only bundle one pipeline into the job jar. Would need support for multiple Python entry points / different configurations. Just wanted to check how soon you are planning to work on this? > Optionally bundle multiple pipelines into a single Flink jar > > > Key: BEAM-8183 > URL: https://issues.apache.org/jira/browse/BEAM-8183 > Project: Beam > Issue Type: New Feature > Components: runner-flink >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink > > [https://github.com/apache/beam/pull/9331#issuecomment-526734851] > "With Flink you can bundle multiple entry points into the same jar file and > specify which one to use with optional flags. It may be desirable to allow > inclusion of multiple pipelines for this tool also, although that would > require a different workflow. Absent this option, it becomes quite convoluted > for users that need the flexibility to choose which pipeline to launch at > submission time." -- This message was sent by Atlassian Jira (v8.3.4#803005)