On Mon, Oct 10, 2016 at 12:57 PM, Amit Sela <[email protected]> wrote:
>> > So this is basically a "FanOut" transformation which will depend on the
>>
>> > available resources of the runner (and the uniqueness of the assigned
>> keys)
>>
>> > ?
>>
>> >
>>
>> > Would we want to Redistribute into a user-defined number of bundles (>
>>
>> > current) ?
>>
>>
>>
>> I don't think there's any advantage to letting the user specify a
>>
>> number here; the data is spread out among as many machines as are
>>
>> handling the shuffling (for N elements, there are ~N unique keys,
>>
>> which gets partitioned by the system to the M workers).
>>
>>
>>
>> > How about "FanIn" ?
>>
>>
>>
>> Could you clarify what you would hope to use this for?
>>
> Well, what if for some reason I would want to limit parallelism for a step
> in the Pipeline ? like calling an external service without "DDoS"ing it ?
I think this is something is more difficult to enforce without
runner-specific support. For example, if one writes
input.apply(Redistribute(N)).apply(ParDo(...))
one is assuming that fusion takes place such that the subsequent ParDo
doesn't happen to get processed by more-than-expected shards. It's
also much simpler to spread the elements out among 2^64 keys than
spread them out to a small N keys, and choosing exactly N keys isn't
necessarily the best way to enforce parallelism constraints (as this
would likely introduce stragglers). One typically wants to reduce
parallelism over a portion (interval?) of a pipeline, whereas
redistribution operates at a point in your pipeline.
I agree that being able to limit parallelism (possibly dynamically
based on pushback from an external service, or noting that throughput
is no longer scaling linearly) would be a useful feature to have, but
that's a bit out of scope here.