[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2017-03-09 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903135#comment-15903135
 ] 

Eugene Kirpichov commented on BEAM-68:
--

I think this needs a discussion on the Beam dev@ mailing list. We should have a 
general approach to annotating pipeline elements with runner-specific 
information. I don't think "annotate the pipeline and let runners ignore it" is 
a good approach; the main reason being that this would violate the abstraction 
boundary, where a pipeline is first constructed in a runner-agnostic way (in 
fact while the runner is not even available), and then run. E.g. the set of all 
possible runner-specific annotations is not known in advance: while "step 
parallelism limit" seems relatively generic, suppose if say Apex allowed you to 
set an "Apex frobnication level" parameter on a step - it would look pretty 
weird if the Beam pipeline API had a withApexFrobnicationLevel method on a 
ParDo, and would introduce an illegal dependency from Beam to Apex.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2017-03-06 Thread Xu Mingmin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898151#comment-15898151
 ] 

Xu Mingmin commented on BEAM-68:


This's required by some runners. With this parameter, runners, like Flink/Storm 
can leverage it, and those, like Dataflow can ignore it.
I'm not sure about the existing implementation of Flink runner, seems like set 
in job level, meaning same parallelism for each step.

FYI Flink parallel 
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/parallel.html 
Storm parallel 
http://storm.apache.org/releases/1.0.1/Understanding-the-parallelism-of-a-Storm-topology.html

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2017-03-05 Thread Xu Mingmin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896589#comment-15896589
 ] 

Xu Mingmin commented on BEAM-68:


Notice this task when tuning a Beam job on Flink.
Would like to bring another perspective, that users want to have more control 
on the parallelism of a data pipeline, to allocate more resource for the busy 
steps, and less for the costless. A fixed parallelism could have performance 
bottleneck, several use cases like:
1. source from a Kafka topic, the parallelism could not be larger then topic 
partition number; similar for other splittable IOs?
2. fewer grouped keys than parallelism;
3. process on a small portion from large input;
4. +1 for case2, to address quota limitation on external dependencies;
  

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)