[ https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219376#comment-15219376 ]
Daniel Halperin commented on BEAM-68: ------------------------------------- Okay, I think I'm partially wrong. KV<K, Iterable<V>> -> ParDo(process all elements in a single DoFn with per-K startBundle/endBundle/etc) is doable as a solution to BEAM-92. -It won't of course work with empty K, so you can't in fact guarantee numShards is matched. -It won't scale. -It overly restricts implementation. but I think it works, in essence, without a model change. Would you prefer to dupe 169 against 92? I don't see a need for more bug bloat here tho. Have suggested edits to the text of either bug that will fix? > Support for limiting parallelism of a step > ------------------------------------------ > > Key: BEAM-68 > URL: https://issues.apache.org/jira/browse/BEAM-68 > Project: Beam > Issue Type: New Feature > Components: beam-model > Reporter: Daniel Halperin > > Users may want to limit the parallelism of a step. Two classic uses cases are: > - User wants to produce at most k files, so sets > TextIO.Write.withNumShards(k). > - External API only supports k QPS, so user sets a limit of k/(expected > QPS/step) on the ParDo that makes the API call. > Unfortunately, there is no way to do this effectively within the Beam model. > A GroupByKey with exactly k keys will guarantee that only k elements are > produced, but runners are free to break fusion in ways that each element may > be processed in parallel later. > To implement this functionaltiy, I believe we need to add this support to the > Beam Model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)