[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-03 Thread Kenneth Knowles (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943801#comment-16943801
 ] 

Kenneth Knowles commented on BEAM-8183:
---

OK, even after browsing the thread (I don't have time to pick it all up) I 
don't quite understand the context. Ankur clarified what I meant to describe. 
All of the above forms of flexibility are in use and important. I would guess 
multi-pipeline is the least important since that is a pretty trivial 
convenience.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-03 Thread Thomas Weise (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943638#comment-16943638
 ] 

Thomas Weise commented on BEAM-8183:


BEAM-8115 is indeed orthogonal and applicable for the cases where 
parameterization can be solved w/o different execution path in the driver 
program. For the remaining cases bundling multiple protos could be the solution.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Kyle Weaver (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943213#comment-16943213
 ] 

Kyle Weaver commented on BEAM-8183:
---

For pipeline options, I filed https://issues.apache.org/jira/browse/BEAM-8115, 
which should be mostly orthogonal to the issue of multiple pipelines.

Also, if we do decide to support multi-pipeline jars, we might just do it in 
Java, because it will require some Java changes anyway, and Java's jar utility 
libraries are a lot less clunky than shell.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Ankur Goenka (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943201#comment-16943201
 ] 

Ankur Goenka commented on BEAM-8183:


I agree, changing a bit of configuration in the proto will serve a lot of use 
cases. A few can be the input/output data file etc.
{quote} 
 You are correct that the Python entry point / driver program would need to be 
(re)executed for a fully generic solution. But that's not necessary for the 
majority of use cases. Those are artifact + configuration. If there is a way to 
parameterize configuration values in the proto, we can address that majority of 
use cases with a single job jar artifact.
{quote}
Will 
[value_provider|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/value_provider.py]]
 help in this case? Dataflow templates use this.

Also, we can enhance the driver class to swap the actual option values in the 
options proto to parameters provided at the submission time.
{quote} 
 But beyond that we also have (in our infrastructure) the use case of multiple 
entry points that the user can pick at submit time.
  
{quote}
 
 Thats a valid usecase. I can't imagine a good way to model it in beam as all 
the beam notions are build considering a single pipeline at a time. Will a 
shell script capable of merging merging the jars for different pipeline.

I think a pipeline docker can resolve a lot of these issues as it will be 
capable of running the submission code in a consistent manner based on the 
arguments provided.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Thomas Weise (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943170#comment-16943170
 ] 

Thomas Weise commented on BEAM-8183:


{quote}The major issue to me seems to be that we need to execute pipeline 
construction code which is environment dependent. To generate new pipelines for 
an environment, we need to execute the pipeline submission code in that 
environment. And this is where I see a problem. Python pipelines have to 
execute user code in python using python sdk to construct the pipeline.
{quote}
You are correct that the Python entry point / driver program would need to be 
(re)executed for a fully generic solution. But that's not necessary for the 
majority of use cases. Those are artifact + configuration. If there is a way to 
parameterize configuration values in the proto, we can address that majority of 
use cases with a single job jar artifact.

My fallback for the exception path would be to generate multiple protos into a 
single jar, which is why I'm interested in this capability. So that jar would 
contain "mypipeline_staging" and "mypipeline_production" and the deployment 
would select the pipeline via its configuration (parameter to the Flink entry 
point). Similar would work for Spark.

But beyond that we also have (in our infrastructure) the use case of multiple 
entry points that the user can pick at submit time.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Ankur Goenka (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943161#comment-16943161
 ] 

Ankur Goenka commented on BEAM-8183:


I see. Thanks for explaining the use case.

I think hardcoded pipeline options are definitely a limitation as of now. We 
can see how we can use Beam's ValueProvider to give dynamic arguments. We can 
also think of overwriting the pipeline options when submitting the jar to fink.
{quote}Running the same pipeline in different environments with different 
parameters is a common need. Virtually everyone has dev/staging/prod or 
whatever their environments are and they want to use the same build artifact. 
That normally requires some amount of parameterization.
{quote}
I don't really have a good solution for dev/staging/prod use case. This is not 
going to be solved by jar with multiple pipelines (as each pipeline will have a 
static set of pipeline options) but by a jar creating dynamic pipelines (as the 
pipeline changes based on the pipeline options and environment).

The major issue to me seems to be that we need to execute pipeline construction 
code which is environment dependent. To generate new pipelines for an 
environment, we need to execute the pipeline submission code in that 
environment. And this is where I see a problem. Python pipelines have to 
execute user code in python using python sdk to construct the pipeline.

Considering this jar as the artifact would not be idle for different 
environment as the actual sdk/lib etc can differ between environments. From 
environment point of view, a docker container capable of submitting the 
pipeline should be an artifact as it has all the dependencies bundled in it and 
is capable of executing code with consistent dependencies. And if we don't want 
consistent dependency across environment, then pipeline code should be 
considered as an artifact as it can work with different dependencies.

 

For context, In dataflow we pack multiple pipeline in a single jar for java and 
for python we generate separate par for each pipeline (We do publish them as a 
single mpm). Further, this does not materialize the pipeline but create an 
executable which is later used in an environment having the right sdk 
installed. The submission process just runs the "python test_pipeline.par 
--runner=DataflowRunner --apiary=testapiary" which goes though dataflow job 
submission api and is submitted as a regular dataflow job.

This is similar to docker model just that instead of docker we use par file and 
execute it using python/java.
{quote}The other use case is bundling multiple pipelines into the same 
container and select which to run at launch time.
{quote}
This will save some space at the time of deployment. Specifically the jobserver 
jar and pipeline staged artifacts if they are shared. We don't really 
introspect the staged artifacts so we don't know what can be shared and what 
can't across pipelines. I think a better approach would be to just write a 
separate script to merge multiple pipeline jars (jar with single pipeline) and 
replace main class to consider the name of the pipeline to pick the right 
proto. The script can be infrastructure infrastructure aware and can make the 
appropriate lib changes. Beam does not have a notion of multiple pipelines in 
any sense so it will be interesting to see how we model this if we decide to 
introduce it in beam.

Note: As the pipelines are materialized, they will still not work across 
environments.

 

Please let me know if you have any ideas for solution this.

 

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Thomas Weise (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943015#comment-16943015
 ] 

Thomas Weise commented on BEAM-8183:


For context: 
[https://lists.apache.org/thread.html/2122928a0a5f678d475ec15af538eb7303f73557870af174b1fdef7e@%3Cdev.beam.apache.org%3E]

Running the same pipeline in different environments with different parameters 
is a common need. Virtually everyone has dev/staging/prod or whatever their 
environments are and they want to use the same build artifact. That normally 
requires some amount of parameterization.

The other use case is bundling multiple pipelines into the same container and 
select which to run at launch time.

I was surprised about the question given prior discussion. Even more so 
considering that Beam already has the concept of user options. The approach of 
generating the jar file currently is equivalent to hard coding all pipeline 
options and asking the user to recompile.

Yes, we could generate a new jar file for every option or environment but 
please not it bloats the container images (job server is > 100MB). We can also 
create separate Docker images, now we are in the GB range. 

 

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-10-02 Thread Kenneth Knowles (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942974#comment-16942974
 ] 

Kenneth Knowles commented on BEAM-8183:
---

I may not understand the issue here, but I can share that for internal Dataflow 
testing we definitely make a huge jar with all the test pipelines in it.

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-09-30 Thread Ankur Goenka (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941330#comment-16941330
 ] 

Ankur Goenka commented on BEAM-8183:


Flink has some neat feature of picking the pipeline on the fly. 

I don't think this is a very common usecase with Beam though.

Given Beam Job Submission api work on a single pipeline at a time it will be 
very convoluted work flow to introduce multiple pipelines in a single jar.

 

Will it be easier to just store pipelines as separate jar in global storage 
(hdfs etc) and pass the right jar at the time of pipeline submission?

In case the submission happen through a service then will it be easier and less 
error prone to just keep these jars separately on the service and submit the 
right jar to flink based on the parameter?

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8183) Optionally bundle multiple pipelines into a single Flink jar

2019-09-27 Thread Thomas Weise (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939795#comment-16939795
 ] 

Thomas Weise commented on BEAM-8183:


Hi Kyle, I have setup my Flink image build to use the jar file runner, but 
currently it can only bundle one pipeline into the job jar. Would need support 
for multiple Python entry points / different configurations. Just wanted to 
check how soon you are planning to work on this? 

 

> Optionally bundle multiple pipelines into a single Flink jar
> 
>
> Key: BEAM-8183
> URL: https://issues.apache.org/jira/browse/BEAM-8183
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> [https://github.com/apache/beam/pull/9331#issuecomment-526734851]
> "With Flink you can bundle multiple entry points into the same jar file and 
> specify which one to use with optional flags. It may be desirable to allow 
> inclusion of multiple pipelines for this tool also, although that would 
> require a different workflow. Absent this option, it becomes quite convoluted 
> for users that need the flexibility to choose which pipeline to launch at 
> submission time."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)