Hi Matthias,
when the shuffle happens is not defined by the Beam model so it depends on
the runner. You are right, though, that a runner can optimise execution
when you specify a CombineFn. In that case a runner can choose to combine
elements before shuffling to reduce the amount of data that we have to
shuffle across the network. With a SerializableFunction that's not possible
because we don't have an intermediate accumulation type as we have with a
CombineFn. Therefore, the runner has to ship all elements for a given key
to one machine to apply the SerializableFunction.

Regarding your second question, could you maybe send a code snipped? That
would allow us to have a look and give a good answer.

Cheers,
Aljoscha

On Tue, 22 Nov 2016 at 12:32 Matthias Baetens <
[email protected]> wrote:

> Hi there,
>
> I had some questions about the internal working of these two concepts and
> where I could find more info on this (so I might be able similar problems
> in the future myself. Here we go:
>
> + When doing a GroupByKey, when does the shuffling actually take place?
> Could it be the behaviour is not the same when using a CombineFn to
> aggregate compared to when using a Serializablefunction? (I have a feeling
> in the first case not all the keys get shuffled to one machine, while it
> does for the second).
>
> + When using Accumulators in a CombineFn, what are the actual internals?
> Is there any docs on this? The problem I run into is that, when I try
> adding elements to an ArrayList and then merge ArrayList, the output is an
> empty list. The problem could probably be solved by using a
> Serializablefunction to Combine everything at once, but you might loose the
> advantages of parallellisation in that case (~ above).
>
> Thanks a lot :)
>
> Best,
>
> Matthias
> --
>
> *Matthias Baetens*
>
>
> *datatonic | data power unleashed*
> office +44 203 668 3680 <+44%2020%203668%203680>  |  mobile +44 74 918
> 20646
>
> Level24 | 1 Canada Square | Canary Wharf | E14 5AB London
>

Reply via email to