Union is just combining data from multiple sources into a single dataset.
That’s it. No memory, no disk involved.
In you case you have
input1.union(input2).groupBy(1).reduce(…)
This will translate into:
input1 -> repartition ->
read-both-inputs -> sort -> reduce
input2 -> repartition ->
So, in your case not even additional network transfer is involved, because both
data sets would need to be partitioned for the reduce anyway.
Note, union in Flink has SQL union-all semantics, i.e., there is not removal of
duplicates.
Cheers, Fabian
From: Flavio Pompermaier
Sent: Monday, 22. December, 2014 14:32
To: [email protected]
Ok thanks Fabian. I'd like just to know the internals of the union of multiple
datasets (partitioning, distribution among server, memory/disk, etc..). Do you
have any ref to this?
Thanks in advance,
Flavio
On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <[email protected]> wrote:
Follow the first approach.
Joins are expensive, union comes for free.
Best, Fabian
2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <[email protected]>:
Hi guys,
In my use case I have multiple Datasets with the same structure (e.g. Tuple3)
and I want to produce an output Dataset containing all Tuple3 grouped by the
first field (0).
I can obtain the same results performing a union of all datasets and then a
group by (simplest implementation) or join all of them pairwise
(((A->B)->C)->D)..) or I don't know if there is any other solution. When should
I use the first or the second approach? Could you help me in figuring out the
internals of the two approaches? I always have some fear when using multiple
joins when I don't know exactly their size..
Best,
Flavio